This web page contains the following sections, which contain a short
description of how the programs work:
Assemblers
EST Assemblers
Scaffolders
Other Programs
Links
PHRAP purely assembles complete reads, that is, it does not trim the reads prior to assembly, and problematic reads, such as vector contaminated reads, must be dealt with before assembly (eg. by cross_match, a part of the phrap package). PHRAP uses read quality data (from phred, also part of the phrap package) and uses this to assign ``quality'' values (Log Likelihood Ratios) to matches, it then assembles the matching reads into contigs by using a greedy algorithm based on the LLR values. A script phredPHRAP is available that automates the process.
PHRAP was used to assemble C. elegans [CESC1998],
and in the preassembly of the human genome [IHGSC2001].
[PHRAP Homepage]
After this, the repeats are unmasked and the contigs are processed locally with a (100 bp) sliding window. Discrepancies between contig and repeats are reassembled in the window only, this produces a more accurate local assembly. Clone-end-pairing information is then used to fill gaps due to masking and to build scaffolds.
The RePS assembler was applied to the 4.2 x coverage WGS sequence data of the rice genome [Wang2002]. RePS has a small computational load and minimizes the chances for making false joins [Yu2002]. It leaves large repeat clusters unassembled, and can be an appropriate approach if detailed knowledge of the structure of the repeats is not a high priority [Wang2002].
The Phusion assembler was applied to assemble the mouse genome from whole-genome shotgun sequences [Mullikin2003].
First vector and low quality sequences are identified and discarded. The reads are initially associated to each other by use of a table to find a minimum of 10 exactly matching 16-mers, followed by a banded Smith-Waterman alignment. Sequences flanking gaps are locally assembled with PHRAP, by permitting relatively short or low quality/repetitive sequences if they are supported by mate-pair constraints [Taylor2002]. In the malign module of the JAZZ assembler, the consistency of sequence overlaps and mate-pair constraints are maximized, by the iterative building and breaking of sequence contigs and scaffolds and the progressive inclusion of lower quality data. Read layout and contig scaffolding is performed by a module called Graphy, using an approach similar to the ARACHNE and Celera assemblers. Consensus generation from the read layout applies base quality values from the reads, resulting in contig assemblies that include quality values.
A presentation of JAZZ is available at the [Jazz
presentation] Apparently it is not possible to download
JAZZ. Instead JGI offers to sequence genomic regions of strong
scientific value, that they are provided by others.
[Homepage].
CAP3 produces fewer errors than PHRAP, the scaffold construction is
easier [Huang1999],
and the program can be used in GAP4 from the Staden package.
[CAP3 Homepage].
The TIGR assembler 2.0 uses a BLAST-like method to compare
fragments. It divides the dataset in repeat and non-repeat containing
fragments, and is able to change match criteria for merging
fragments, and is able to handle alternative splicing. The TIGR
Assembler 2.0 gives additional information about pairs of sequences
which probably are chimeric, repeats or splice variants.
[TIGR Homepage].
[Helpfile].
The paired read information is then to validate the merging of
reads into contigs. Read pairs which do not cross a marked repeat
boundary are merged. In theory, repeat contigs will be created in
this step. After marking these, the remaining contigs will be unique
contigs [Batzoglou2002]
(unitigs in the terminology of [Myers2000]). By
using forward-reverse links from plasmid reads, Arachne orders and
orients the contigs into longer layouts called supercontigs
(scaffolds). The consensus sequence with quality scores is then
created, by converting pairwise alignments of reads into multiple
alignments.
[ARACHNE2
Homepage].
The Euler assembler creates a virtual Sequencing By Hybridization
(SBH) - problem, by breaking the reads into overlapping n-mers. A de
Bruijn graph is build, in which each edge corresponds to an n-mer
from one of the original sequence reads. The source and destination
nodes corresponds respectively to the n-1 prefix and n-1 suffix of
the corresponding n-mer. The original DNA sequence is reconstructed,
by finding a path that uses all the edges exactly once - an Eulerian
path [Pop2002], [Pevzner2001a],
[Pevzner2001b]. In
case of multiple available paths, it is possible to use read-pair
information in finding the correct paths, by using double barreled
DNA sequence data. The Euler-assembler outputs a graph, that
satisfies all read-pair information, and lets one filter out the
wrong read-pairs [Pevzner2001b].
[Euler Homepage]
It is divided into 8 steps. First the reads are trimmed, so that the remaining reads have sufficiently long non-contaminant regions of high quality. In the overlap step, candidate overlaps are found by comparing reads sharing a rare k-mer. Overlaps are evaluated by performing a banded alignment. Additionally, repeated sequences will be rejected and details are recorded that will help sort out borderline cases [Havlak2004]. Next combining WGS data with the light sequence coverage of individual BACs (BAC skims), ``enriched BACs'' (eBACs) are created. Read pair mates are added to the WGS reads with the best overlap to BAC skim reads [Rat Genome Sequencing Project Consortium, 2004]. The WGS and skim reads are assembled with PHRAP. The overlapping eBACs are found based on information about shared reads. The overlaps are confirmed by using BLASTZ to align eBACs to create BAC contigs (bagtigs). The reads and the contigs in the bagtigs are assembled with PHRAP. Bagtigs are linked by read pairs and BAC skim read distribution to create superbactigs. Finally, ultrabagtigs are build and mapped to the chromosome with the use of map and synteny data [Havlak2004] , which was also used in the process to verify the assembly.
GAP4 provides a range of tools to aid in assembly, besides the internal assembler it can also use PHRAP, CAP2, CAP3 or FAKII. The internal assembler works by aligning reads to previously processed reads or contigs, if a read aligns well to one contig it is merged with the contig. Two contigs are merged if the read aligns consistently with both of them, and a new ``contig'' is formed by the read if no alignment can be found.
In addition to assembly GAP4 has many interactive tools to view and modify the assembly. It is possible to manually find contigs or repeats and to join contigs. Read-pair and template information can be used in assembly of the contigs, and the reads can be compared with the consensus segments they overlap. Several plots of the assembly coverage and quality can be displayed. Some of the steps in pregap4 can use other programs than provided in the Staden package. A newer version, gap4.new, uses a new probabilistic consensus algorithm, that utilizes the base quality information, cutting down on manual editing.
GAP was originally described in [Bonfield1995].
[GAP4 Homepage]
The next part of the algorithm attempts to identify sequence positions that have erroneous basecalls, these are marked as defined nucleotide positions (DNPs). The DNPs are found by multiple alignment of all the overlapping reads at that position, and comparing the differences in bases with those expected from the quality values, if a significant difference is found the reads are likely to be from different repeat regions of the genome.
Using this information TRAP can assemble even repetitive regions with high similarity into contigs, which is performed by a greedy algorithm. However, TRAP needs a considerable coverage to be able to distinguish between different reads [Tammi2003].
After sequence cleaning and trimming potential overlaps are found by
creating a histogram of distances between 8-mers, and the potential
overlaps are examined with a banded Smith-Waterman
algorithm. Overlaps are scored and contigs are build
iteratively. Mira tries to resolve ambiguous bases in the multiple
alignment process by examining the tracefiles for the involved
reads. From these it attempts to distinguish true SNPs from faulty
basecalls, and use the SNP information to differentiate between
repetitive regions [Chevreux2004].
[mira
Assembler Homepage]
In the clustering step, a stringent fast pairwise alignment between sequences sharing significant regions of near identity is performed. This is done by using mgblast (modified version of megablast) [Zhang2000], [Pertea2003].
To avoid chimeric assemblies and to help create smaller and better partitioned clusters, known full-length transcripts can be used for 'seeded clustering'. In the seeded clustering it is assumed that a complete gene transcript has nearly perfect identity with all ESTs from that gene; lateral extension of seeds is limited to nearly perfect alignments [Pertea2003].
In the assembly phase each cluster is assembled using the CAP3
assembly program [Huang1999]. The individual assembly of each
cluster has the advantage of producing larger, more complete
consensus sequences while eliminating potential misclustered
sequences [Pertea2003].
[TGICL Homepage]
The program builds a distributed representation of a generalized suffix tree data structure based on the EST sequences. In the beginning, each cluster consists of one EST. Then it performs an early identification of EST pairs that are likely to merge clusters. This is done in parallel, by generating pairs ranked after their maximal common substring length. If an EST pair show a significant pairwise alignment, the clusters where the ESTs originate are merged. This process is continued until no further merges are possible. In this way it is not necessary to make pairwise alignments of all EST combinations. PaCE can be extended to be able to predict alternative splicing sites, but does not use quality values [Kalyanaraman2003].
One advantage to this approach, as opposed to the traditional
consensus motif approach, is that no information is lost when the
graph is constructed. Another benefit is that for similar sequences
the computational time is reduced. POA has been used to investigate
similarities in protein domain and EST sequences [Lee2002].
[POA Homepage]
Due to the more stringent clustering method distiller can have problems clustering EST from lightly expressed (low coverage) genes. It was used to analyze EST sequences from the Xenopus Tropicalis project [Gilchrist2004].
After the primary assembly into contigs have been performed larger structures (scaffolds) cen be constructued from the individual contigs, linking the contigs together by using information other than the sequence reads.
As Bambus is designed to work independently of the assembler, it is possible for the user to influence the scaffolding algorithm and the parameters used therein. Bambus typically uses clone mate information to infer links between contigs, but other sources of information can be used in scaffolding. It uses a greedy algorithm to assign contigs to a scaffold, but the order in which the contigs are processed is decided by parameters provided by the user.
A novel feature of Bambus is that it does not necessarily create a linear scaffold at all costs, but instead marks ambiguous parts of the scaffold for further analysis. These ambiguities (tangles) can be a result of repeats or different haplotypes being sequenced, and as such these tangles can give further information. A part of the bambus package Untangle is able to process the ambiguous parts into linear scaffolds [Pop2004]. Bambus is now part of the AMOS package.
First the sequence is decontaminated and repeat masked using RepeatMasker. Then mRNA, EST, BAC end and paired plasmid reads are aligned against the initial sequence contigs, by building a list of 10-mers, and then aligning likely candidates. In the third step an input directory structure is created by using the Washington University map and other data. After this, the initial sequence contigs within each fingerprint clone contig are aligned against each other. Then GigAssembler merges the overlapping initial sequence contigs within each fingerprint clone contig. The resulting contigs are then ordered and oriented into scaffolds. Finally, the contig assemblies are combined into full chromosome assemblies. The program is unable to detect misassemblies or chimerism in the initial sequence contigs [Kent2001].
First the input sequences are aligned against the genome by using BLAT
C/S to find suspicious alignments. Suspicious alignments are
corrected by using SIM4. Next
the EST clustering is performed. Only sequences with introns are
considered, since most contaminated ESTs are expected to be
unspliced. Sequences sharing a splice site are grouped together to
produce primary clusters, allowing variations within +/- 6 bps. The
connectivity of exons in each primary cluster is represented as a
directed acyclic graph. All possible paths along exons are found
using the depth-first search method. Each path represents a potential
splice variant. Exons without clone coverage are trimmed away from
the model, and the result are compared with other gene models as a
redundancy check. The presence of polyA tails and the GT-AG consensus
in the intron sequences, are used to determine the gene boundary and
the direction of the gene. Unspliced sequences with correct
annotation are then added, not changing the exon-intron boundaries of
existing gene models. The primary cluster obtained until this stage
corresponds to multi-exon genes, whose subclusters represent splice
variants. The remaining unspliced sequences are further clustered
according to overlap in the genomic loci. The resulting clusters,
representing single exon genes, are added to the list of primary
clusters [Kim2004].
[ASmodeler Homepage]
The algorithm uses POA (Partial Ordered Alignment) to align the EST (and mRNA) data to the genomic sequence, using full dynamic programming with gap penalties that allow intron sequences. After this adjacent interval that appear to be exons are merged, and a splice graph, with exons as nodes, is constructed [Heber2002]. The splice graph is then processed to find the most likely spliceform (the isoform), with the Heaviest Bundling (HB) algorithm.
The transcript sequence of each isoform is constructed by assembling
the genomic sequence intervals that constitute its exons. The protein
sequences are generated by searching for the longest ORFs within each
transcript [Xing2004].
[ASP database Homepage]