AGILE: AliGnIng Long rEads

AGILE: AliGnIng Long rEads

Overview

AGILE

Recent advances in Next Generation Sequencing (NGS) technology have led to affordable desktop-sized sequencers with low running costs and high throughput. These sequencers produce small fragments of the genome being sequenced as a result of the sequencing process. By mapping these small fragments (reads) to a reference genome, we can sequence the DNA of a new individual. The NGSs are making it possible for these studies to be conducted at a mass scale. This is believed to usher an era of personal genomics when each individual can have his/her dna sequenced and studied to come up with more personalized ways of anticipating, diagnosing and treating diseases.

The studies of this nature have already begun. Following are two very recent examples of such kinds of studies. James Lupski, a physician-scientist who suffers from a neurological disorder called Charcot-Marie-Tooth has found the genetic cause of his disease by sequencing his entire genome (late 2009). Another study, the first to describe the genomes of an entire family of four, confirmed the genetic root of a rare disease, called Miller syndrome, afflicting both children (March 2010).

A number of different companies are involved in building sequencers - 454, Illumina and Applied Biosciences to name a few. The rate of throughput as well as read lengths of these NGSs are increasing at a pace that puts even the Moore's law to shame. Hence there is a growing need of tools that can work for longer reads and can still match the pace of the NGSs.

AGILE

AGILE is a sequence mapping tool specifically designed to map the longer reads (read length > 200) to a given reference genome. Currently it works for 454 reads, but efforts are being made to make it suitable to work for all sequencers, which produce longer reads. Looking at the current trend of increasing read lengths, soon most of the sequencers have read lengths > 200. In comparison with existing tools, the most significant features of AGILE are:

High flexibility. It allows a large number of mismatches and insertions/deletions in mapping. Current version of AGILE has been tested to work with upto 10% differences in mapping.
High Sensitivity. AGILE correctly maps about ~99.8% reads.
Ability to handle large datasets. We have successfully tested AGILE with human genome and a million batch queries.
Speed. Using AGILE, we can map approximately 1.1 million reads of length 500 each to a reference human genome per hour. That means about 550 million bases per hour. At this rate, AGILE will need only 6 hours for a 1X coverage of the human genome.

Download AGILE

Please download AGILE here : AGILE_Linux_x86_64_0.4.0

This version works for a 64 bit linux operating system.

Download Data

The fasta version of human genome hg19 can be downloaded here: hg19_unmasked

A sample fasta file containing real 454 reads can be downloaded here: SRR005010_15.fa

Third party Galaxy Wrapper for AGILE

A third party python script and xml wrapper can be downloaded here (thanks to Simon Lank from O'Connor Lab, WNPRC, Madison WI for writing them): AGILE_Galaxy_wrapper

So, how do I use it?

Usage:
agile database query [options] output_file > mapping_quality_file
where:
database and query are each either a .fa , .nib or .2bit file,
or a list these files one file name per line.
output_file is file for the mapped result.
Options:
-tileSize=k	sets the length of tuples for creating hash table. Usually between 11 and 20 (default 16)
-maxSIMs=n	sets the maximum #SIMs (single imperfect matches) allowed. These include mismatches and indels (default 5 with -all option and 100 without -all option)
-maxFreq=F	sets the maximum number of occurrences of a pattern (k-tuple) that are allowed. k-tuples which occur more than F times are marked as overused and ignored. The default value depends on the read length (for example, F = 8 for read length of 500).
-all	If this is used, the program outputs all the alignments which satisfy maxSIMs=F. If this is not used, the program simply tries to find the best alignment and outputs the best alignment it can find and also all the other alignments with the same score as the best one.
-out=type	sets output file format. Type is one of: psl - Default. Tab separated format, no sequence pslx - Tab separated format with sequence axt - blastz-associated axt format maf - multiz-associated maf format sim4 - similar to sim4 format wublast - similar to wublast format blast - similar to NCBI blast format blast8- NCBI blast tabular format blast9 - NCBI blast tabular format with comments

Publications

Sanchit Misra, Ankit Agrawal, Wei-keng Liao, Alok Choudhary. Anatomy of a Hash-based Long Read Sequence Mapping Algorithm for Next Generation DNA Sequencing. Bioinformatics 2010; doi: 10.1093/bioinformatics/btq648.
Sanchit Misra, Ramanathan Narayanan, Wei-keng Liao, Alok Choudhary and Simon Lin. pFANGS: Parallel High Speed Sequence Mapping for Next Generation 454-Roche Sequencing Reads. In Proc. Ninth IEEE International Workshop on High Performance Computational Biology (IPDPS 2010), April, 2010, Atlanta, GA.
Sanchit Misra, Ramanathan Narayanan, Simon Lin and Alok Choudhary. FANGS: High Speed Sequence Mapping for Next Generation Sequencing Reads. In Proceedings of ACM Symposium of Applied Computing (ACM SAC), March 22-26, 2010, Sierre, Switzerland.

Current Status

Current sequential version of AGILE (AGILE_0.4.0) has been tested to handle upto 10% error. It has a throughput of about 0.5 Gigabytes per hour.
The latest version comes with two modes : 1) output the best alignment and all the alignments with the same score as the best one and 2) output all alignments that satisfy a "minimum similarity criteria".