Manual for ATLAS_GapFill Pipeline (V2.2) ATTENTION: Please understand Section 7. Trouble-shooting before you run ATLAS-GapFill package. 0. Introduction Many gaps in WGS assembled genome are due to repeats near unique regions, and whole genome assembly does not know where to place repetitive reads. We figured out a way to deal with the repetitive reads. In other words, the globally repetitive reads might be locally unique for each gap region. So many gaps within scaffolds might be filled. Based on the above idea, we developped ATLAS_GapFill pipeline to improve an existing reference genome by closing gaps within the scaffolds using Illumina mate reads. The major steps are: (1) Identify gap associated reads from BWA mapping results for each gap, and assemble each gap locally using different available assemblers (Phrap, Newbler, and Velvet), (2) Compare the locally assembled contigs to the corresponding reference scaffold using Crossmatch, and (3) Fill the gaps of the scaffold with the locally assembled contigs that bridge the gaps. 1. New Features This is version 2.2 of the pipeline. In comparison with version 2.1, it improves in the following areas. 1.1 A pretest code is used to check the status of three assemblers (phrap, velevt, and newbler) and the user's cluster. The details are discussed in Section 4 (3). 1.2 Unlike the previous versions, the scaffold fasta file is optional in input_list.txt. 1.3 It is easier to monitor the assembly and crossmatch processes. The details are discussed in Section 6 (3). 1.4 More trouble shootings are discussed in Section 7. 2. Prerequisites (1) System Unix or Linux operating system (tested on Redhat) Cluster queuing software (tested with qsub, bsub and msub) Each node memory more than 16Gb (2) Perl any version (tested on v5.8.8 built for x86_64-linux-thread-multi) (3) Third party tools BWA (test version: 0.5.7) Phrap (test version: 0.990329) Newbler (test version: MapAsmResearch-04/19/2010) Velvet (test version: 1.0.15) Crossmatch (test version: 0.990319) 3. Install Download the file of ATLAS_GapFill_V2.2_release.tar.bz2 and unzip it to your local directory. 4. Before running ATLAS_GapFill (1) Mapping Illumina reads using BWA (a) An input file (-l option) is required for read mapping which lists the location of fastq pairs for each library with the type, size and range information. Example: ./mapping_testdata/data.list Lib_RunId Lib_ID LibSize_mean Libsize_stdv PathtoLane1.fastq,PathtoLane2.fastq Libsize_min Libsize_max Lib_Type L500a L500 500 50 L500.1.fastq,L500.2.fastq 0 1000 frag L2000a L2000 2000 200 L2000.1.fastq,L2000.2.fastq 500 4000 mate ATTENTION: If your reads are paired-end (the library size is less than 1kb usually), the last column should be 'frag'. If your reads are mate-pair (the library size is long than 1kb usually), there are two cases. If your reads are not trimmed and include the junction, the last column should be 'mate'. If your reads are parsed to remove the linker, the last column should be 'frag'. (b) Index the target sequences The mapping process uses the BWA software and the target sequences (usually contigs from an assembly program). The target sequences should be indexed by the BWA index function. Use "bwa index -a is target.contigs" for a small target and "bwa index -a bwtsw target.contigs" for a large target. For details please see the BWA manual (http: //bio-bwa.sourceforge.net/bwa.shtml). (c) Map the reads to the target sequences The "ATLAS_GapFill_0_mapping_frag_and_mate_wrapper.pl" program contains two steps: the "map" step that maps all the reads and the "merge" step that merges the mapping results. It requires inputs for four options: -t for target sequences, -l for data list, -d for script directory and -s to specify whether it performs "map" or "merge" The script directory contains scripts used by ATLAS_GapFill_0_mapping_frag_and_mate_wrapper.pl. It should be $install_dir/mapping_perldir after installation. An example to map the reads from a data.list to target.contigs is as follows. First, to perform the "map" step: perl ATLAS_GapFill_0_mapping_frag_and_mate_wrapper.pl -t target.contigs -d ./mapping_perldir -l data.list -s "map" Second, to perform the "merge" step: perl ATLAS_GapFill_0_mapping_frag_and_mate_wrapper.pl -t target.contigs -d ./mapping_perldir -l data.list -s "merge" It is better to require 32Gb to run the above command. An example of mapping results for 'frag' reads is ./mapping_testdata/L500.1.fastq.merge. An example of mapping results for 'mate' reads is ./mapping_testdata/L2000.1.fastq.h60_combine.fildup. One of the ouput files (example ./mapping_testdata/data.list.sam.list) will be used for ATLAS-gapfill configure file. A example of small testdata is shown in the folder of ./mapping_testdata. (2) Generate ATLAS-GapFill input_list.txt used for -i option. Example: ./testdata/example_input_list.txt (a) The shell name is defined (example/bin/bash). It is used in the shell scripts for qsub. (b) The scaffold agp file of the reference genome assembly to improve. (example: ./testdata/test.agp) (b) The scaffold fasta formatted sequence file which contains the gaps to fill. (example: ./testdata/test.agp.fa) (d) The sequence fasta formatted sequence file of the contigs in the genome assembly agp file. (example: ./testdata/test.agp.ctgs.fa) (e) The file of data.list.sam.list generated by bwa mapping is used as bamlist. Here is the example used in BCM-HGSC: perl create_ATLAS_GapFill_input_listtxt.pl -shellname /bin/bash -agp /stornext/snfs6/assembly/ATLAS_GapFill_V2.1_release/testdata/test.agp [-agpfa /stornext/snfs6/assembly/ATLAS_GapFill_V2.1_release/testdata/test.agp.fa] -ctgsfa /stornext/snfs6/assembly/ATLAS_GapFill_V2.1_release/testdata/test.agp.ctgs.fa -bamlist /stornext/snfs6/assembly/ATLAS_GapFill_V2.1_release/testdata/data.list.sam.list -outprefix test1 -wkrootdir /stornext/snfs6/assembly/ATLAS_GapFill_V2.1_release/testdata -GapFillerBindir /stornext/snfs6/assembly/ATLAS_GapFill_V2.1_release -perl /usr/bin/perl -phrap /stornext/snfs6/assembly/local-bin/bin/phrap -newbler /users/yl131317/bin/runAssembly -velveth /users/yl131317/work_dir/velvet_1.0.15/velveth -velvetg /users/yl131317/work_dir/velvet_1.0.15/velvetg -crossmatch /stornext/snfs6/assembly/local-bin/bin/crossmatch > example_input_list.txt ATTENTION: -newbler /users/yl131317/bin/runAssembly <-- correct -newbler /users/yl131317/bin/newbler <-- WRONG!!! (3) Validate input_list.txt cd ./test_input_txt.list perl ../pretest_input_txt.list.pl -i example_input_list.txt -r reads.fa -u qsub [-q your_queue_name] or perl ../pretest_input_txt.list.pl -i example_input_list.txt -r reads.fa -u nc The output is shown in ./test_input_txt.list/example_output. 5. Run ATLAS-GapFill There are three steps to run this package (-s step1, -s step2, and -s step3). You can't run the next step until the current one is done. The default way of queue is msub. The other three options are qsub,bsub, or nc (which means no cluster to use). The default required memory is 16000mb. In BCM-HGSC, our queue name is called 'analysis', so we use option -q analysis. step1: perl ATLAS_GapFill.pl -i your_input_list.txt -s step1 -u qsub [-q your_working_queue -m 16000mb] Example: perl /stornext/snfs6/assembly/ATLAS_GapFill_V2.2_release/ATLAS_GapFill.pl -i example_input_list.txt -q analysis -s step1 -u qsub -m 16000mb Once you start with the step1, please check the standard error file (example:./testdata/step1.error) to see if something is wrong. If the step1 is done, 20 files starting with step1- will be generated in the folder of (your_prefix)Checkpoints (example:./testdata/run1_Checkpoints). step2: perl ATLAS_GapFill.pl -i your_input_list.txt -s step2 -u qsub [-q your_working_queue -m 16000mb] Example: perl /stornext/snfs6/assembly/ATLAS_GapFill_V2.2_release/ATLAS_GapFill.pl -i example_input_list.txt -q analysis -s step2 -u qsub -m 16000mb Once you start with the step2, please check the standard error file (example:./testdata/step2.error). If the step2 is done, 12 files starting with step2- will be generated in the folder of (your_prefix)Checkpoints (example:./testdata/run1_Checkpoints). step3: perl ATLAS_GapFill.pl -i your_input_list.txt -s step3 -u qsub [-q your_working_queue -m 16000mb] Example: perl /stornext/snfs6/assembly/ATLAS_GapFill_V2.2_release/ATLAS_GapFill.pl -i example_input_list.txt -q analysis -s step3 -u qsub -m 16000mb Once you start with the step3, please check the standard error file (example:./testdata/step3.error). If 26 more step3-* files are generated, the whole pipeline is finished successfully. ATTENTION: If possible, please use -u option as qsub, msub, or bsub. The -u nc option is supported but not recommended. 6. Understand ATLAS-GapFill results (1) (your_prefix)Checkpoints step 1: 20 files step 2: 12 files step 3: 26 files (2) (your_prefix)Runningjobs Once a job is submitted to the serve, a tracing file will be generated. After the job is done, the tracing file will be deleted. For example: ./testdata/step3asmphrap.175.job Once you open it, you will see something like: ----- The current running job is: qsub /stornext/snfs6/assembly/ATLAS_GapFill_V2.2_release/testdata/testShellScripts/step3asmphrap.175.job.sh -o /stornext/snfs6/assembly/ATLAS_GapFill_V2.2_release/testdata/testShellScripts/step3asmphrap.175.job.sh.output -e /stornext/snfs6/assembly/ATLAS_GapFill_V2.2_release/testdata/testShellScripts/step3asmphrap.175.job.sh.error -q analysis -l mem=16000mb The current running job ID is: 4411976.sug-moab ----- If interested, please check the standard error in the file after -e option, and check the standard output in the file after -o option. In this example, please check the following two files: /stornext/snfs6/assembly/ATLAS_GapFill_V2.2_release/testdata/testShellScripts/step3asmphrap.175.job.sh.output /stornext/snfs6/assembly/ATLAS_GapFill_V2.2_release/testdata/testShellScripts/step3asmphrap.175.job.sh.error (3) (your_prefix)Msubgaplistdir The *stderr files show the process of assembly and crossmatch for each gap. For example: gap.145.list.newbler.stderr assembly process: RUNNING_JOB: /stornext/snfs6/assembly/ATLAS_GapFill_V2.2_release/testdata/nc2_Scafdir/101/Scaffold501 2-Contig42353-Contig53275 Scaffold501 COMPLETE: The newbler assembly /users/yl131317/bin/runAssembly -o /stornext/snfs6/assembly/ATLAS_GapFill_V2.2_release/testdata/nc2_Scafdir/101/Scaffold501/2-Contig42353-Contig53275/nfa /stornext/snfs6/assembly/ATLAS_GapFill_V2.2_release/testdata/nc2_Scafdir/101/Scaffold501/2-Contig42353-Contig53275/reads.N.fa is done, and the result can be found in /stornext/snfs6/assembly/ATLAS_GapFill_V2.2_release/testdata/nc2_Scafdir/101/Scaffold501/2-Contig42353-Contig53275/nfa. crossmatch process: RUNNING_JOB: /stornext/snfs6/assembly/ATLAS_GapFill_V2.2_release/testdata/nc2_Scafdir/101/Scaffold501 2-Contig42353-Contig53275 Scaffold501 COMPLETE: crossmatch /stornext/snfs6/assembly/local-bin/bin/crossmatch /stornext/snfs6/assembly/ATLAS_GapFill_V2.2_release/testdata/nc2_Scafdir/101/Scaffold501/2-Contig42353-Contig53275/gscaf.fa /stornext/snfs6/assembly/ATLAS_GapFill_V2.2_release/testdata/nc2_Scafdir/101/Scaffold501/2-Contig42353-Contig53275/newbler.contigs.fa -minscore 50 -alignments > /stornext/snfs6/assembly/ATLAS_GapFill_V2.2_release/testdata/nc2_Scafdir/101/Scaffold501/2-Contig42353-Contig53275/newbler.contigs.fa.XM; gzip -f /stornext/snfs6/assembly/ATLAS_GapFill_V2.2_release/testdata/nc2_Scafdir/101/Scaffold501/2-Contig42353-Contig53275/newbler.contigs.fa.XM is done, and the result is saved as /stornext/snfs6/assembly/ATLAS_GapFill_V2.2_release/testdata/nc2_Scafdir/101/Scaffold501/2-Contig42353-Contig53275/newbler.contigs.fa.XM.gz (4) (your_prefix)Results FINAL.contigs.fa FINAL.scaffolds.fa FINAL.scaffolds.agp Note: (a) The gaps are filled by three assemblers. (b) All the contigs are renamed. pnv10000.gapfilled.agp Note: (a) The gaps are filled by three assemblers. If you want to know how many gaps are closed by newbler, try grep -c newbler pnv10000.gapfilled.agp If you want to know how many gaps are closed by phrap, try grep -c phrap pnv10000.gapfilled.agp If you want to know how many gaps are closed by velvet, try grep -c velvet pnv10000.gapfilled.agp (b) The original contig name are not changed. newbler.gapctgs.fa Note: This file includes all the contigs generated by assembling the reads gap by gap using newbler. newbler.gapfilled.1.agp Note: (a) The gaps are filled by newbler only. (b) The original contig name are not changed. phrap.gapctgs.fa Note: This file includes all the contigs generated by assembling the reads gap by gap using phrap. phrap.gapfilled.1.agp Note: (a) The gaps are filled by phrap only. (b) The original contig name are not changed. velvet.gapctgs.fa Note: This file includes all the contigs generated by assembling the reads gap by gap using velevt. velvet.gapfilled.1.agp Note: (a) The gaps are filled by velevt only. (b) The original contig name are not changed. Note: The other folders store the intermediate results. 7. Trouble-shooting (1) runAssembly NOT newbler should be used in input_list.txt. Section 4 (2) perl create_ATLAS_GapFill_input_listtxt.pl -newbler /users/yl131317/bin/runAssembly NOTE: For -newbler option, runAssembly instead of newbler should be used. (2) The contig namesNAMES in contigs.fa and scaffold NAMES should MATCH those in agp file. (3) The option for reads mapping should be set in the right way. Section 4 (1) perl ATLAS_GapFill_0_mapping_frag_and_mate_wrapper.pl -t target.contigs -d ./mapping_perldir -l data.list -s "map" perl ATLAS_GapFill_0_mapping_frag_and_mate_wrapper.pl -t target.contigs -d ./mapping_perldir -l data.list -s "merge" NOTE: -d ./mapping_perldir/ is wrong. -d ./mapping_perldir is right. (4) If you submit your jobs to your cluster, a job is not finished in the folder of (your_prefix)Runningjobs. For example:step1-8-asmphrap.175.job less step1-8-asmphrap.175.job ---------- The current running job is: echo "/usr/bin/perl /stornext/snfs6/assembly/ATLAS_GapFill_V2.2_release/ATLAS_GapFill_3_run_localGap_Phrap.pl /stornext/snfs6/assembly/ATLAS_GapFill_V2.2_release/testdata/ZZ_Msubgaplistdir/gap.93.list /stornext/snfs6/assembly/local-bin/bin/phrap /stornext/snfs6/assembly/ATLAS_GapFill_V2.2_release/testdata/ZZ_Runningjobs/step1-8-asmphrap.175.job" | msub -q analysis -l pmem=16000mb -e /stornext/snfs6/assembly/ATLAS_GapFill_V2.2_release/testdata/ZZ_Msubgaplistdir/gap.93.list.phrap.stderr -o /stornext/snfs6/assembly/ATLAS_GapFill_V2.2_release/testdata/ZZ_Msubgaplistdir/gap.93.list.phrap.stdout -d /stornext/snfs6/assembly/ATLAS_GapFill_V2.2_release/testdata The current running job ID is: 4661628 ---------- check STDERR file (-e option) /stornext/snfs6/assembly/ATLAS_GapFill_V2.2_release/testdata/ZZ_Msubgaplistdir/gap.93.list.phrap.stderr (a) If the STDERR file doesn't exist, it means that the job is not submitte to your cluster. Just rerun the above job. (b) If the STDERR file does exist, find out which gap assembly failed. grep RUNNING_JOB /stornext/snfs6/assembly/ATLAS_GapFill_V2.2_release/testdata/ZZ_Msubgaplistdir/gap.93.list.phrap.stderr RUNNING_JOB: /stornext/snfs6/assembly/ATLAS_GapFill_V2.2_release/testdata/ZZ_Scafdir/101/Scaffold101 8-Contig40625-Contig40626 Scaffold101 grep COMPLETE /stornext/snfs6/assembly/ATLAS_GapFill_V2.2_release/testdata/ZZ_Msubgaplistdir/gap.93.list.phrap.stderr ### got nothing This means the gap 8-Contig40625-Contig40626 assembly got failed, because there is no corresponding mark like "COMPLETE". Edit /stornext/snfs6/assembly/ATLAS_GapFill_V2.2_release/testdata/ZZ_Msubgaplistdir/gap.93.list, and remove the line including the gap 8-Contig40625-Contig40626. Rerun the above job. After the job is done, cd */Checkpoints and manually generate a file called step1-8-asmphrap.Finished. (5) If you submit your jobs to your cluster, it is harder to fix. Here is the way: For the unfinished gap list, use the following command: perl ATLAS-gapfill-fix-newbler.pl input_list.txt -step3Asm -Newbler -gapLoL (full_path/)new.list -overwrite ( see Section 7 (6) ) For the failed gap list, find out which specific gap got trouble, and then remove it ( see Section 7 (4) ). (6) Sometimes newbler may get failed in step1. Remember this care is very rare. Here is the error information: ----- Error: An internal error has occurred in the computation. Assertion: node->getPreviousChordNode() != __null Location: 'void ChordEdge::removeFromList(ChordEdgeNodePtr)' [ChordEdge.cpp:470] Please report this error to your customer support representative. ----- Please try the following way to fix it: (a) cd */Runningjobs delete the running jobs (b) kill the running jobs (c) open a new login node: msub -q analysis -I -V -l pmem=16000mb (d) cd */Msubgaplistdir vi gap.147.list edit gap.147.list to remove the probalemic gap. ls gap.147.list >new.list (e) perl ATLAS-gapfill-fix-newbler.pl input_list.txt -step3Asm -Newbler -gapLoL (full_path/)new.list -overwrite (f) cd */Checkpoints manually generate a file called step1-8-asmnewbler.Finished (g) Keep running step2 (-s step2) 8. Authors Carson Jiaxin Qu (jqu@bcm.edu), Henry Xingzhi Song, and Yue Liu. 05/24/2012 9. License Copyright(c) 2012, by Human Genome Sequencing Center, Baylor College of Medicine. All rights reserved. 10. References BWA http: //bio-bwa.sourceforge.net/bwa.shtml Phrap http: //www.phrap.org/phredphrap/phrap.html Newbler http://my454.com/products/analysis-software/index.asp Velvet http://www.ebi.ac.uk/~zerbino/velvet/ Crossmatch http://www.phrap.org/phredphrap/general.html