Rat Genome Project History

Strategy for the Rat Project

The strategy for the Rat project is a hybrid of the Whole Genome Shotgun (WGS) approach used for the mouse genome and the hierarchical (BAC clone) approach used for the human genome.

The sequencing combines whole-genome shotgun (WGS) reads produced by Celera, Genome Therapeutics Corp., Baylor and UUGC with BAC shotgun reads produced by Baylor from the CHORI-230 library and BAC end reads by TIGR. Fingerprint contigs contributing to BAC selection and genome assembly were generated by the BC Cancer Agency Genome Sciences Center.

The assembly process consists of multiple phases

Selection of BACs

As mapping data for the Rat has developed, selection criteria have improved. Clones were initially selected at random, then selected based on fingerprint contigs, and finally selected by clone-walking and gap-filling using tiling paths of enriched BAC assemblies and BAC-end sequences.

Enriched BAC assembly

Individual WGS reads are mapped to specific BAC projects on the basis of sequence overlaps and mate pairs. The combined read set for each BAC is assembled with Phrap, followed by contig fixup and scaffolding based on mate pair constraints (mostly 2 kb and 10 kb inserts, but including some 50 kb and BAC ends). These enriched BAC assemblies are available clone-by-clone at NCBI. 20,987 BAC projects were skimmed and assembled, using 36.3 million high-quality reads as input. Additional synthetic projects were generated and assembled from rescued mistracked sequences (these are denoted by Baylor project names beginning with "ky" or "kz").

Bactigging

Sets of contiguous overlapping BAC clones (bactigs) are identified based on sequence overlaps of scaffolds, shared WGS reads and WGS mate pairs (including BAC end pairs). 20,638 BACs were combined into 1607 bactigs.

Bactig reassembly

Each bactig is reassembled using similar methods to the enriched BAC assemblies, applying Phrap in a windowed fashion to minimize repeat interference within the bactig. After contig generation by Phrap, paired-read information was used to identify misassemblies, split misjoined contigs, and order & orient contigs into scaffolds.

Superbactig scaffolding

Bactigs and singleton BACs linked by WGS mate pairs (including BAC-end sequences) were combined into 917 superbactigs and rescaffolded. 94 of the superbactigs were singletons BACs.

Ultrabactig construction

Bactig sequences are linked together based on WGS and BAC end pairs and FPC (fingerprint contig) information. The 419 ultrabactigs have an average size of 6.54 Mb (N50 size of 18.5 Mb). The sequence coverage is 93 percent over the entire genome.

Genome mapping

Ultrabactigs are mapped to the genome based on RHMap data from MCW and, in regions of no reliable marker data, using synteny to prior mammalian assemblies. Unmapped ultrabactigs were placed in the chrUn files. Where small scaffolds from an ultrabactig could not be ordered and oriented, they were placed in the "random" file corresponding to the same chromosome where the main scaffold was mapped.

Finished sequence substitution

High quality finished BAC sequences are mapped to the draft genome assembly. The finished sequences were spliced into the assembly, replacing contigs or parts of contigs that were contained within. Draft contigs that were in the draft assembly in a location replaced by a finished sequence with draft contig sequence that was not contained in the finished sequence were moved to a different location.

Publications

Havlak P, et al. The Atlas genome assembly system. Genome Res. 2004 Apr;14(4):721-32.

Rat Genome Sequencing Project Consortium. Genome sequence of the Brown Norway rat yields insights into mammalian evolution. Nature. 2004 Apr 1;428(6982):493-521.