|
Overview
|
Recent
advances in Next Generation Sequencing (NGS) technology have led to
affordable desktop-sized sequencers with low running costs and high
throughput. These sequencers produce small fragments of the genome being
sequenced as a result of the sequencing process. By mapping these small
fragments (reads) to a reference genome, we can sequence the DNA of a new
individual. The NGSs are making it possible for these studies to be
conducted at a mass scale. This is believed to usher an era of personal
genomics when each individual can have his/her dna sequenced and studied to
come up with more personalized ways of anticipating, diagnosing and
treating diseases.
The studies of this nature have already begun. Following are two very
recent examples of such kinds of studies. James Lupski, a
physician-scientist who suffers from a neurological disorder called
Charcot-Marie-Tooth has found the genetic cause of his disease by
sequencing his entire genome (late 2009). Another study, the first to
describe the genomes of an entire family of four, confirmed the genetic
root of a rare disease, called Miller syndrome, afflicting both children
(March 2010).
A number of different companies are involved in building sequencers - 454,
Illumina and Applied Biosciences to name a few. The rate of throughput as
well as read lengths of these NGSs are increasing at a pace that puts even
the Moore's law to shame. Hence there is a growing need of tools that can
work for longer reads and can still match the pace of the NGSs.
|
AGILE
|
AGILE
is a sequence mapping tool specifically designed to map the longer reads
(read length > 200) to a given reference genome. Currently it works for
454 reads, but efforts are being made to make it suitable to work for all
sequencers, which produce longer reads. Looking at the current trend of
increasing read lengths, soon most of the sequencers have read lengths >
200. In comparison with existing tools, the most significant features of
AGILE are:
- High
flexibility. It allows a large number of mismatches and insertions/deletions
in mapping. Current version of AGILE has been tested to work with upto
10% differences in mapping.
- High
Sensitivity. AGILE correctly maps about ~99.8% reads.
- Ability
to handle large datasets. We have successfully
tested AGILE with human genome
and a million batch queries.
- Speed. Using
AGILE, we can map approximately 1.1
million reads of length 500
each to a reference human
genome per hour. That means about 550 million bases per hour. At this rate, AGILE will need only 6 hours for a 1X coverage
of the human genome.
|
Download
AGILE
|
Please download AGILE here : AGILE_Linux_x86_64_0.4.0
This version works for a 64 bit linux operating system.
|
Download
Data
|
The
fasta version of human genome hg19 can be downloaded here: hg19_unmasked
|
A
sample fasta file containing real 454 reads can be downloaded here: SRR005010_15.fa
|
Third party Galaxy Wrapper for
AGILE
|
A third party python script and xml wrapper can be downloaded here (thanks to Simon Lank from O'Connor Lab, WNPRC, Madison WI for writing them): AGILE_Galaxy_wrapper
|
So,
how do I use it?
|
Usage:
|
agile
database query [options] output_file > mapping_quality_file
|
where:
|
database
and query are each either a .fa , .nib or .2bit file,
|
or
a list these files one file name per line.
|
output_file
is file for the mapped result.
|
Options:
|
-tileSize=k
|
sets
the length of tuples for creating hash table.
Usually between 11 and 20 (default 16)
|
-maxSIMs=n
|
sets
the maximum #SIMs (single imperfect matches) allowed. These include
mismatches and indels
(default 5 with -all option and 100 without -all option)
|
-maxFreq=F
|
sets
the maximum number of occurrences of a pattern (k-tuple) that are
allowed. k-tuples which occur more than F times are marked as overused
and ignored. The default value depends on the read length (for example, F
= 8 for read length of 500).
|
-all
|
If
this is used, the program outputs all the alignments which satisfy
maxSIMs=F. If this is not used, the program simply tries to find the
best alignment and outputs the best alignment it can find and also all the other alignments with the same score as the best one.
|
-out=type
|
sets
output file format. Type is one of:
psl - Default. Tab separated format, no sequence
pslx - Tab separated format with sequence
axt - blastz-associated axt format
maf - multiz-associated maf format
sim4 - similar to sim4 format
wublast - similar to wublast format
blast - similar to NCBI blast format
blast8- NCBI blast tabular format
blast9 - NCBI blast tabular format with comments
|
|
Publications
|
- Sanchit
Misra, Ankit Agrawal, Wei-keng Liao, Alok Choudhary. Anatomy of a
Hash-based Long Read Sequence Mapping Algorithm for Next Generation
DNA Sequencing. Bioinformatics 2010; doi:
10.1093/bioinformatics/btq648.
- Sanchit
Misra, Ramanathan Narayanan, Wei-keng Liao, Alok Choudhary and Simon
Lin. pFANGS: Parallel High Speed Sequence Mapping for Next
Generation 454-Roche Sequencing Reads. In Proc. Ninth IEEE
International Workshop on High Performance Computational Biology
(IPDPS 2010), April, 2010, Atlanta, GA.
- Sanchit
Misra, Ramanathan Narayanan, Simon Lin and Alok Choudhary. FANGS:
High Speed Sequence Mapping for Next Generation Sequencing Reads.
In Proceedings of ACM Symposium of Applied Computing (ACM SAC),
March 22-26, 2010, Sierre, Switzerland.
|
Current
Status
|
- Current
sequential version of AGILE (AGILE_0.4.0) has been tested to handle
upto 10% error. It has a throughput of about 0.5 Gigabytes per hour.
- The
latest version comes with two modes : 1) output the best alignment and all the alignments with the same score as the best one and
2) output all alignments that satisfy a "minimum similarity criteria".
|
|