Atlas Whole Genome Assembly Suite


Atlas is a collection of software tools to facilitate the assembly of large genomes from whole genome shotgun reads, or a combination of whole genome shotgun reads and BAC or other localized reads.

This suite of tools has been used in the assemblies of the rat ( Rattus norvegicus), fruit fly (D. p seudoobscura), and the honey bee ( Apis mellifera); and will be used for the assemblies of sea urchin (Strongylocentrotus purpuratus), bovine (Bos taurus), and other species.

This is a preliminary alpha release of this software. Some components of the software used to produce the Rat assembly are not yet packaged for distribution. These will be added shortly. While we anticipate no problems running this software on different platforms, this software has not yet been extensively tested off-site.

What we have in the package

Documents

Binaries:

Binaries are available for linux and Sun Solaris Unix. A Mac OS X version will be available soon. They are packed in separate gzipped tar files atlas-linux.tgz and atlas-solaris.tgz.

The above three are meant to be called by the driver script atlas-generate-scaffold, not to be run as stand-alone programs.

Perl scripts

You need to edit the above four perl scripts according to where you install your perl binary in your system. See the next section A Quick Tour for detail.

Perl modules

Subroutines for scaffolding.

Reads used for demonstration

A Quick Tour

  1. After downloading the file atlas-linux.tgz, or atlas-solaris.tgz, or atlas-osx.tgz (available soon) for the correct operating system, move it to a directory where you want to install the atlas assembly package, then unpack the file by typing the command:

    tar xvzf atlas-linux.tgz, or
    tar xvzf atlas-solaris.tgz.

    You will get the following subdirectories and files:

    documents/
    documents/readme.html
    documents/steps.html
    documents/graphics/
    bin/
    bin/atlas-overlapper
    bin/atlas-splitbadcontigs
    bin/atlas-screen-window
    bin/atlas-binner
    bin/atlas-trimphraptails
    bin/atlas-linearsequence
    bin/atlas-count-batch
    local/
    perl/
    perl/bin/
    perl/bin/atlas-build-scaffold-file
    perl/bin/atlas-check-contig
    perl/bin/atlas-generate-scaffold
    perl/bin/atlas-demo
    perl/lib/
    perl/lib/Atlas/
    perl/lib/Atlas/Scaffold.pm
    perl/lib/Atlas/Utility/
    perl/lib/Atlas/Utility/ObjectAttribute.pm
    perl/lib/Atlas/Project/
    perl/lib/Atlas/Project/Trace.pm
    perl/lib/Atlas/Project/Contig.pm
    perl/lib/Atlas/ScaffoldHeapEle.pm
    data/
    data/demo.fa
    data/demo.fa.qual
    data/demo.fa.screen

  2. We assume that the perl program in your system is /usr/bin/perl. If it is not, you need to edit the first line (#!/usr/bin/perl) accordingly for all four perl scripts in the perl/bin/ subdirectory.
  3. Set your environment parameter for ATLAS_ROOT. This should be the the directory where you have already unpacked and installed the atlas package.
  4. In bash shell, use the command
    export ATLAS_ROOT=xxxx
    You may want to put this line in the .bashrc file in your home directory.

    In C shell, use the command
    setenv ATLAS_ROOT xxxx
    And the equivalent file is .cshrc.

  5. In the Scaffolder, we need to use perl modules
    Heap and Heap::Fibonacci,
    which are not a part of the standard perl library. Check to make sure you have them in your system. If not, you can get them from CPAN (Comprehensive Perl Archive Network).

    You can either install them in the default perl library directory in your system, or in the ATLAS_ROOT/perl/lib directory. In the latter case, they are:
    perl/lib/Heap.pm, and
    perl/lib/Heap/Fibonacci.pm.

  6. We use phrap as the assembler for single-bin assembly. Your need to make a symbolic (soft) link from local/phrap to the phrap executable in your computer. Please do the following:

    cd local
    ln -s <full_path_of_phrap> phrap

    where <full_path_of_phrap> is where phrap is installed in your computer. For example, at HGSC, <full_path_of_phrap> is /home/hgsc/bin/phrap

    If you do not have phrap, please see the webpage http://www.phrap.org for infomation about how to get a copy.

  7. Now you are ready to run atlas-demo with the data we supplied.

    Change to perl/bin directory, and then run the command

    atlas-demo -d <root_dir> -a <asm_dir> <project_name>
    where <root_dir> is the directory where you will put all your projects in,

    <project_name> is the project name, and
    <asm_dir> is the assembly name.

    For example, if you want to put all your assembly projects in the directory /data/atlas/projects, call the demo project abcd, and run it as asm01, then type
    atlas-demo -d /data/atlas/projects -a asm01 abcd

  8. The results:

    The data files we supplied are copied to /data/atlas/projects/reads directory, where they are indexed, and trimmed. 32mer analysis are also done in that directory.

    The assembly results are in /data/atlas/projects/abcd/asm01 directory. Only the final ace and linearized scaffold files are directly under this directory. Intermediate results are categorized into pre-bin, bin-asm, post-bin-asm, and stored in the corresponding subdirectories. The scaffold sequence files in the blastz subdirectory are for the same scaffolds, but each scaffold is written in a separate file. It is easier to run blastz for each scaffold, thus so named for the subdirectory.

    For the ace file format, see http://www.phrap.org.

    In the graph file, output is written in the form
    SourceName QueryName1{QuerySpan,QueryScore,LeftExtension,RightExtension,SeedCopyCount} QueryName2...

    For the format of scaffold files, see the notes for atlas-linearsequence.

Now click here to download the archive of the above files.

Notes:

atlas-overlapper -help

atlas-overlapper -s {SourceFile} -q {QueryFile} -o {OutFile} [{heurisitc options}] {query2 query3 query4 ....}

IMPORTANT LIMITATIONS:

Query read sequences can not be longer than 4095bp. (2^12) There can not be more than 524,000 source reads. (2^19)

UNPREDICTABLE behavior will result if these limits are exceeded.

Uses a filtration technique to compare all of the sequences in a FASTA file to each other, identifying all of the overlaps between the sequences. Small samples are taken from each read and compared with the full sequence of all other reads using a fast exact match algorithm (hash table in current version; suffix trees have been used in other versions). Repeats are filtered out, de novo, in order to produce physical overlaps only. A banded alignment of all pairs that contain a common low-copy n-mer is performed to determine true overlap amounts to be reported in a graph file. The output will be in the form:

SourceName QueryName1{QuerySpan,QueryScore,LeftExtension,RightExtension,SeedCopyCount} QueryName2...

A single query file can be specified with the -q option (for backward compatibility), but multiple query files can also be added onto the line, with no arbitrary limit.

-s {SourceFile} Name of source FASTA file. Hash will be built from source.
-q {QueryFile} Name of query FASTA file. Query reads will be sampled and samples compared to source hash.
-Q {QueryFOF} A file which lists the query files to use. Can be used in conjunction with -q or with a list of queries as free arguments. (The purpose of this option is to allow very long query lists).
-o {OutFile} Name of output file. This will be an adjacency list graph of the overlaps for each sequence in the source file.
-v Print version and version history.

------ Misc -----------------------------

-b {Begin with} Number of Sequence in source to start with. (0 based).
-e {End with} Number of Sequence in source to end with.
The b and e options are for building the hash from a large FASTA file (e.g. > 50,000 reads), where the file has to be compared in chunks. Since it scans through the source until it reaches read b, this is not as efficient as creating several separate source files and running the program several separate times, but it is often more convenient.
-Z gzip the output graph. Default is not-gzipped. Output graphs can get quite large, especially now that multiple values are being reported for each edge. This option will save the graph output directly to a gzip file. NOTE: Does NOT automatically tack on .gz to file name.
-F Output format: Flip sense of overlap extensions so they are relative to the read that is origin of the edge (default relative to sink of edge).
-f Output format: Flip direction of edges to go from source read to query read (default is from query to source).

----- Heuristics ------------------------------

-P

Prime numbered kill hash size. A good value will be at least 1.2 times the number of kill mers you have. Integer value selects one of the following (5 is default):

0 357913931 Roughly 3 GB of space.
1101000777 Roughly 900MB of space.
2 79999987 Roughly 800 MB of space.
3 59999999 687 MB
4 49999991 429 MB
5 39999943 340 MB
6 29999947 258MB
-p Directly specify prime numbered kill hash size (overrides -P).
-O {oligo size} Size of k-mer to use for seeding banded alignments (4-32). Allow the sample length to be specified. Default 32bp.
-B {band size} Band size for banded alignments (Default=2, which is +/- 2 from diagonal). Valid range: 1-25.
-R {cutoff mers) This is the repeat cutoff specified as a number of mers, rather than as a number of standard deviations above the mean. Overrides -r option. Default is to use -r.
-r {RepeatThresh} Cutoff for repeats. Samples occuring more than mean+r*sigma are ignored as repeat samples. This cutoff attempts to be a function of the actual coverage, since the mean number of occurences of a random sample will be close to the mean coverage, skewed high a bit by the presence of repeats. For low-coverage data (C < 2) r may need to be 2 or 3 since sigma is likely to be small. For high-coverage data, r closer to 0 is probably appropriate. I plan to make this more sophisticated when I get a chance so that the user doesn't have to think about it. Default = 1.
-K {RepeatFileName} When specified a file of known repeat mers is read in and used to filter repeats. File lists mer, a tab, and mer count on each line. K for kill list.
-k {MinRepeatCont} When reading in -K repeat file, save in RAM only those mers with count >= MinRepeatCount.
-Y {MaxOverlapSeed} Max seed. Largest mer count that will be used to seed an overlap. This is analgous to the -R option, except that it is based on the mer frequencies reported in RepeatFile (-K), which are globally derived. This option works before the -R option, so the -R stats will be computed on the mers that remain.
-H {HashMerLimit} This is another form of repeat cutoff. The purpose of this limit is to constrain the size of the hash. Basically, HashMerLimit is a hard limit on the number of locations that will be saved for any mer. It should be set to something just slightly larger than what you expect mean+r*sigma above to be. 100 or 1000 are pretty safe values in most cases. The smaller this number, the more reads can be put into each hash, and so quicker a set of jobs will go. Default value is 1000.
-m {MaxMismatch%} Maximum mismatch percent to accept in the algined region. Edges which reflect more than this fraction of mismatches will not be reported. This is translated into a score cutoff as follows. Since each match is +1 and each mismatch is -1 (or -2 for indel, which we ignore here), there is a double penalty to a mismatch so far as the score is concerned (no +1 for the match, and a -1 penalty). So, for a particular span, the score that is reflected by a x-1.995869raction of mismatches will be score=(1-(x/100))*span - (x/100)*span = (1-(2x/100))* span. So the cutoff is simply cutoff = (1-(2x/100))*span. If score < cutoff, the edge is omitted from the graph. Default value is 2 (i.e. 2%).
-M {Min Read Size} Minimum size of reads to consider. Default is 100bp.
-S {SamplesPerRead} Samples per read. Code tries to take this many samples, but read may be too short in which case it takes as much as it can.
-I {SampleInterior} Samples across entire read. Default is end sampling only, where SamplesPerRead/2 samples are taken from each end of the read. End sampling typically gives you more sensitivity per sample since we are looking for end-to-end matches, but there arise situations where it is necessary to look at the interior of reads also.
-G {SampleGap} Step between samples. Default is same as sample size (-O), for perfectly tiled samples (barring Ns).

--- Experimental options -----

-x {BucketSize} Set the initial hash bucket size.

Written by: James Durbin (kdurbin@bcm.tmc.edu)
Version %s1.70

Back

atlas-binner -help

Reads in an asymmetric directed graph, with each possible source node on a separate line, followed by its whitespace-delimited outedges. Each outedge is represented as a pair, nodename whitespace weight, where the weight is between 1 and 99. Output is one file of read names (one read per line) per bin. Core reads of a bin occur in only one bin; but peripheral reads may occur in multiple bins. Bins are named prefixNNNNsuffix, where prefix0000suffix is the default bin for reads not placed elsewhere [Defaults are given in square braces.] [$Revision: #2 $]

Option values are:

-f {GraphFile} [] Name of graph file (empty => cin).
-p {BinPrefix} [bin] Prefix for bin file names.
-s {BinSuffix} [.fon] Suffix for bin file names.
+A [-] Boolean, put all sinks of core edges in one bin if true.
-i {Ignore} [35] Integer, minimum below which edge weight is ignored.
-c {Coverage} [6] Floating-point, estimated coverage of input graph file. This implies -l (3 * coverage) and -r (2 * coverage) unless alternative values for those options are given.
-l {FrontierLmt} [-1] Integer, maximum size of the frontier in BFS.
-L {FrontierLmt} [-1] Integer. If this is different from -l, all values between -l and -L will be used.
-r {RepeatEdge} [-1] Integer, max edges before node treated as repeat. This is used in BAC-fishing mode only.
-k {KillRepeat} [12] Integer, kill overlap edges seeded by nmer with more than this many copies.
+S [-] Boolean, consider extensions relative to the sink (otherwise relative to origin).
+E [-] Boolean, count left and right neighbors separtely against RepeatEdge count
-n {EstNodes} [262144] Integer, estimate max number nodes for efficiency's sake.
-m {MinBinSize} [10] Integer, minimum number of nodes per output bin.
-x {MaxBinSize} [99999] Integer, maximum number of nodes per output bin.
-e {MaxBinDist} [10000] Integer, maximum distance from start of bin in edges.
+w [+] Boolean, use estimated-exact-span weighting.
-d {DebugString} [] Each character indicates a debugging option to turn on;
'+' indicates turn on all debugging.
Back

atlas-count-batch

atlas-count-batch K HASHSIZE SLICING SLICE MINREPORT

Arguments are:

K kmer size, in our demonstration K=32
HASHSIZE A large prime for hash table size (genomesize*1.2)/slicing factor < hash table size < (available RAM in bytes)/10
SLICING hash slicing factor: 1/SLICING of k-mers will be tabulated, thus SLICING number of jobs will be needed to get the complete k-mer table.
SLICE which job this is of several jobs (0, ..., SLICING-1) to get complete table.
MINREPORT print k-mers with frequencies >=MINREPORT; print only distribution if MINREPORT=0
Back

atlas-screen-window -help

Reads in two fasta sequence files (masked and unmasked) and a quality file for a set of reads to be trimmed. Also requires a file of vector sequences to be agressively trimmed (any 10bp match) at the beginnings of reads. Looks for the first and last passing windows and masks any leading or trailing junk. Use with '-s v' to screen for Phrap without quality trimming.

Boolean options are specified with + to turn on, - to turn off (e.g., -x)

Default Option values are:

-O {string} [] suffix to append to arg for opening original (unscreened) sequence file
-S {string} [] suffix to append to arg for opening screened sequence file
-Q {string} [.qual] suffix to append to arg for opening quality file
-p {phredscore} [20] minimum phred quality score for good bases
-l {length} [100] minimum number of good bases in insert for passed read
-w {length} [50] trim read from first passing window of this size to last
-P {length} [50] look extra hard for vector in this many bases at front of read
-e {error_rate} [1.25] maximum expected error rate for a passing window
-k or +k [+] trim to window bases with qual >= 20
-x or +x [-] + means trim/screen by hard masking (Xs) instead of soft (lowercase)
-s {char} [b] v = vector only, q = quality only, b = both
-f or +f [+] create sequence file for processed reads
-q or +q [+] create quality file for processed reads
-o or +o [+] include all reads (not just passed) in output files
-t or +t [-] trim by deleting bases outside windows instead of masking
-V {string} [] path of file to read vector sequence from (for screening first 50 bases)
-P {length} [50] look extra hard for vector in this many bases at front of read
-d {DebugString} [v] Each character indicates a debugging option to turn on;
'+' indicates turn on all debugging.
'v' gives some statistics
'V' gives lots of statistics
-v or +v Print RCS version number
-h print usage and option descriptions

For example,

atlas-screen-window +t -q -d 'Vo' foo.fa bar.fa

Back

atlas-splitbadcontigs -help

atlas-splitbadcontigs -a AceIn -q ReadQualityIn -o AceOut {options}

A moving average of the template coverage is computed starting at the ends and going inward. The window size for this moving average is given by trimWindow, and is expressed in base pairs. The trim cutoff is specified as the first place where the moving average of the template coverage is greater than trimHeight. The contig is split into two contigs around this 'template crash'. Reads which overlap the left and right boundary of the crash region go to the left or right if they are anchored there by a matched mate, but otherwise are simply dropped.

When a ProblemFile is given with the -p option, a totally different set of more subtle split rules are used that aren't just based on template crashes.
-a AceFile
-q Read qualities (for new consensus computation).
-o Ace output file name.
----------------
-r TrimmedReadListOut
-w trimWindow (for default template crash mode, i.e. not used with -p)
-c trimCoverage (for default templat crash mode, i.e. not used with -p)
-g graph output which includes Rep_node: specifiers. When given only nodes listed in the Rep_node: list will be trimmed, others will be left alone. This option excludes -p. With this option, splits are on template crashes only.
-p Problem file. This is a file that identifies regions of suspicious patterns of external links. These are probably problems, so if at all possible, a split will be done on each of these problems. This option excludes -g. With this option, a series of rules are applied to determine if the suggested problem area is real and, if so, how to split.
-M Minimum size of zero template coverage contig to keep after splitting.
Back

atlas-trimphraptails -help

atlas-trimphraptails -a AceIn -q ReadQualities -o AceOut {options}

A moving average of the template coverage is computed starting at the ends and going inward. The window size for this moving average is given by trimWindow, and is expressed in base pairs. The trim cutoff is specified as the first place where the moving average of the template coverage is greater than trimHeight.

-a AceFile
-q Read qualities (for new consensus computation).
-o Ace output file name.
----------------
-r TrimmedReadListOut
-w trimWindow
-c trimCoverage
-g graph output which includes Rep_node: specifiers. When given only nodes listed in the Rep_node: list will be trimmed, others will be left alone. Unlike previous versions, starting with version 1.1 'node' is taken to be just one side of a contig. So if Contig63_right is the only rep node, only the right half of Contig63 will be trimmed.
-M Minimum size of trimmed contig to keep. If either (or both) of contig and trim portion of contig are smaller than this, they are dropped.

Back

atlas-linearsequence -help

atlas-refsequence -a AceFile -g ScaffoldGraph -o Output {Options}

Reads in an ace file and an associated graph file and produces a reference sequence. Sequence is written out in the relative order and orientation given by the graph file. Gaps between ordered and oriented sequences are padded with Ns. In SingleSequence mode Gaps between unordered sequences are padded with Xs:

> Scaffold0
sssssNNNNNsssssNNNNssssXXXXXsssssNNNNsssssXXXssssXXXssssXXX

In multiple-sequence mode, each scaffold is one seqquence like:

>Scaffold0
sssssNNNNNsssssNNNNssss
>Scaffold1
sssssNNNNsssss
>UnscaffoldedContig2
ssss
>UnscaffoldedContig3
ssss

Output is to a file that starts with ERROR if the sequence ends up with blocks of > 50K Ns.

Back to atlas-demo quick tour

-a Ace file name.
-g Scaffold .graph file name.
-o Output file name.
-p Project name to append to _Scaffold1, etc.
-q Also write quality file out to Output.qual
-S Enable single sequence mode. Default is multiple sequence.
-m Multiple *file* mode. Each scaffold written to it's own file.
-F Use fixed number of Ns to pad between scaffolds rather than based on distance estimate.
-s (15000) Minimum size of scaffold to output.
-U Include unscaffolded contigs (that meet criteria for unscaffolded contigs)
-M Minimum size for BAC only unscaffolded contigs.
-Q (1) Quality trim for contigs.
-X (100) Number of X's to pad between unordered sequences.
-N (50) Minimum number of Ns to pad between ordered seqs.
-G MaxGapSize to allow in output. Reports an error and writes output to .FAIL file if gap greater than this in output. Default is disabled.
-O MaxOutputSize. Reports an error and writes output to .FAIL file if the total sequence size exceeds this ammount. Default is disabled.
-D Show debugging output (Contig and gap structure printed).
Back