Review of assembly programs

Supplementary material to article reference

Below we give a brief overview and description of the different assemblers. However, given the number of different assembly programs that have been created, this list is by no means exhaustive.

This web page contains the following sections, which contain a short description of how the programs work:
Assemblers
EST Assemblers
Scaffolders
Other Programs
Links

Assembly programs

PHRAP: PHRAP (PHRagment Assembly Program) is originally designed for and mostly used in the assembly of data from shotgun sequencing, but has been used in EST clustering, genotyping, and to identify sequence polymorphisms. In the assembly process it allows the use of entire reads; not just trimmed high quality parts of the sequences. To improve accuracy in the assembly process in the presence of repeats, a combination of user supplied information and internal information is used, like clone name, the direction of the read, and the dye chemistry used to generate the read. PHRAP provides extensive information about the assembly to assist in trouble-shooting.
PHRAP purely assembles complete reads, that is, it does not trim the reads prior to assembly, and problematic reads, such as vector contaminated reads, must be dealt with before assembly (eg. by cross_match, a part of the phrap package). PHRAP uses read quality data (from phred, also part of the phrap package) and uses this to assign ``quality'' values (Log Likelihood Ratios) to matches, it then assembles the matching reads into contigs by using a greedy algorithm based on the LLR values. A script phredPHRAP is available that automates the process.
PHRAP was used to assemble C. elegans [CESC1998], and in the preassembly of the human genome [IHGSC2001].
[PHRAP Homepage]
RePS: RePS (REpeat masked PHRAP with Scaffolding) is an assembler that masks exact repeats identified from shotgun data [Wang2002]. It builds a library of 20-mers and masks sequences containing words occurring more than a certain number of times (as they are likely to be repeats), then the remaining reads are clustered using BLAST, and the respective clusters are assembled (on separate processors) using PHRAP.
After this, the repeats are unmasked and the contigs are processed locally with a (100 bp) sliding window. Discrepancies between contig and repeats are reassembled in the window only, this produces a more accurate local assembly. Clone-end-pairing information is then used to fill gaps due to masking and to build scaffolds.
The RePS assembler was applied to the 4.2 x coverage WGS sequence data of the rice genome [Wang2002]. RePS has a small computational load and minimizes the chances for making false joins [Yu2002]. It leaves large repeat clusters unassembled, and can be an appropriate approach if detailed knowledge of the structure of the repeats is not a high priority [Wang2002].
Phusion: Phusion starts by trimming the reads with respect to base pair quality (from Phred), vector sequences and other contaminants. Then it forms a histogram of the frequencies of length k words (k-mers), and based on this, the k-mers that occur more than D times are excluded from assembly (as repeats). The reads are then linked based on shared k-mers and merged into clusters if they share more than a threshold number of such links. The clusters are assembled into contigs with a variant of PHRAP, RPphrap, that is capable of using read-pair information to extend (and break) contigs iteratively. The contigs are then compared and two contigs which share a read can be merged if inconsistencies do not appear. Finally the contigs are ordered iteratively using read-pair information, starting from contigs with small gaps and progressing to contigs with larger gaps, until the gap size exceeds insert size.
The Phusion assembler was applied to assemble the mouse genome from whole-genome shotgun sequences [Mullikin2003].
Jazz: JAZZ is a suite of computational tools, developed at JGI (the Joint Genome Institute) specifically for large sequencing projects, and is a multi-step process which uses a similar approach to the Arachne [Batzoglou2002] and Phusion [Ning2001] WGS assemblers [Taylor2002].
First vector and low quality sequences are identified and discarded. The reads are initially associated to each other by use of a table to find a minimum of 10 exactly matching 16-mers, followed by a banded Smith-Waterman alignment. Sequences flanking gaps are locally assembled with PHRAP, by permitting relatively short or low quality/repetitive sequences if they are supported by mate-pair constraints [Taylor2002]. In the malign module of the JAZZ assembler, the consistency of sequence overlaps and mate-pair constraints are maximized, by the iterative building and breaking of sequence contigs and scaffolds and the progressive inclusion of lower quality data. Read layout and contig scaffolding is performed by a module called Graphy, using an approach similar to the ARACHNE and Celera assemblers. Consensus generation from the read layout applies base quality values from the reads, resulting in contig assemblies that include quality values.
A presentation of JAZZ is available at the [Jazz presentation] Apparently it is not possible to download JAZZ. Instead JGI offers to sequence genomic regions of strong scientific value, that they are provided by others.
[Homepage].

CAP3: The CAP3 (Contig Assembly Program 3) clips 5' and 3' low quality end regions in reads. The overlap detection is performed by finding chains of ungapped segment identical alignment. The overlaps are processed with a banded Smith-Waterman and scored, taking quality values into account. Reads are then joined to form contigs in decreasing order of overlap scores, and forward-reverse constraints are used to make corrections to contigs. Finally a multiple sequence alignment of the reads is constructed, and a consensus sequence, along with a quality value for each base, is computed for each contig. In construction of the multiple alignment and consensus sequence the quality scores are used.
CAP3 produces fewer errors than PHRAP, the scaffold construction is easier [Huang1999], and the program can be used in GAP4 from the Staden package.
[CAP3 Homepage].

TGIR Assembler 2.0: The 2.0 version of the TIGR Assembler, has been modified significantly from the original. The modified version is a tool for assembly of large sets of overlapping sequence data such as ESTs, BACs, or small genomes. It is capable of using quality values. The 2.0 version delivers better performance and results than the previous version, assembling data with greater care given to repeat detection and contig-level overlapping.
The TIGR assembler 2.0 uses a BLAST-like method to compare fragments. It divides the dataset in repeat and non-repeat containing fragments, and is able to change match criteria for merging fragments, and is able to handle alternative splicing. The TIGR Assembler 2.0 gives additional information about pairs of sequences which probably are chimeric, repeats or splice variants.
[TIGR Homepage].
[Helpfile].
ARACHNE 2: Arachne 2 is an improved version of the Arachne assembler. It is designed to analyze unpaired reads and paired forward and reverse reads obtained from both ends of plasmid clones and uses quality scores. It start by trimming for low quality sequences and contaminant sequences. Overlap detection is performed by creating a table 24-mer words, and reads sharing words are merged and aligned. Reads with repeats are not eliminated from the assembly, however their associated 24-mers are not used in merging/aligning.
The paired read information is then to validate the merging of reads into contigs. Read pairs which do not cross a marked repeat boundary are merged. In theory, repeat contigs will be created in this step. After marking these, the remaining contigs will be unique contigs [Batzoglou2002] (unitigs in the terminology of [Myers2000]). By using forward-reverse links from plasmid reads, Arachne orders and orients the contigs into longer layouts called supercontigs (scaffolds). The consensus sequence with quality scores is then created, by converting pairwise alignments of reads into multiple alignments.
[ARACHNE2 Homepage].
Celera Assembler: The Celera assembler is divided into several stages. After the sequences have been cleaned for vector and bacterial reads, the Screener masks sequences containing repetitive DNA, based on a repeat library. Then the Overlapper makes pairwise comparisons of unscreened reads to find overlaps, using a BLAST-like approach. After this the Unitigger assembles fragments into contigs (unitigs) and then searches for unique contigs that does not contain repetitive sequences (Unique Unitigs or U-unitigs). Finally the Scaffolder aligns the unique sequences into scaffold contigs using mate-pair information. From these, the consensus sequence is generated (taking the base quality values into account) [Myers2000]. The Celera Assembler has now become part of the opensource AMOS assembler package.
Euler: While assembler programs like PHRAP, CAP3, TIGR and the Celera assembler are based on a heuristic algorithm and uses the "overlap-layout-consensus" paradigm, the Euler assembler uses a very different approach, namely an Eulerian Superpath approach. The program uses clone end sequencing by using double barreled (DB) data to perform the assembly. DB data was first used in the assembly of Haemophilus influenza [Fleischmann1995], and previously only the Celera and the GigAssembler used them [Pevzner2001a], [Pevzner2001b].The result is an assembler capable of assembling repeat regions, which is complicated for other assemblers.
The Euler assembler creates a virtual Sequencing By Hybridization (SBH) - problem, by breaking the reads into overlapping n-mers. A de Bruijn graph is build, in which each edge corresponds to an n-mer from one of the original sequence reads. The source and destination nodes corresponds respectively to the n-1 prefix and n-1 suffix of the corresponding n-mer. The original DNA sequence is reconstructed, by finding a path that uses all the edges exactly once - an Eulerian path [Pop2002], [Pevzner2001a], [Pevzner2001b]. In case of multiple available paths, it is possible to use read-pair information in finding the correct paths, by using double barreled DNA sequence data. The Euler-assembler outputs a graph, that satisfies all read-pair information, and lets one filter out the wrong read-pairs [Pevzner2001b].
[Euler Homepage]
Atlas: ATLAS was used to assemble the draft assembly of the rat genome, which was unusual in that it used a sequencing approach which combined whole genome shotgun with light (1x) BAC coverage.
It is divided into 8 steps. First the reads are trimmed, so that the remaining reads have sufficiently long non-contaminant regions of high quality. In the overlap step, candidate overlaps are found by comparing reads sharing a rare k-mer. Overlaps are evaluated by performing a banded alignment. Additionally, repeated sequences will be rejected and details are recorded that will help sort out borderline cases [Havlak2004]. Next combining WGS data with the light sequence coverage of individual BACs (BAC skims), ``enriched BACs'' (eBACs) are created. Read pair mates are added to the WGS reads with the best overlap to BAC skim reads [Rat Genome Sequencing Project Consortium, 2004]. The WGS and skim reads are assembled with PHRAP. The overlapping eBACs are found based on information about shared reads. The overlaps are confirmed by using BLASTZ to align eBACs to create BAC contigs (bagtigs). The reads and the contigs in the bagtigs are assembled with PHRAP. Bagtigs are linked by read pairs and BAC skim read distribution to create superbactigs. Finally, ultrabagtigs are build and mapped to the chromosome with the use of map and synteny data [Havlak2004] , which was also used in the process to verify the assembly.
GAP4: Prior to the use of the GAP4 assembler, the reads have to pass through the preassembly program pregap4. This preassembly involves several steps, including assigning quality values, conversion of data formats and vector clipping.
GAP4 provides a range of tools to aid in assembly, besides the internal assembler it can also use PHRAP, CAP2, CAP3 or FAKII. The internal assembler works by aligning reads to previously processed reads or contigs, if a read aligns well to one contig it is merged with the contig. Two contigs are merged if the read aligns consistently with both of them, and a new ``contig'' is formed by the read if no alignment can be found.
In addition to assembly GAP4 has many interactive tools to view and modify the assembly. It is possible to manually find contigs or repeats and to join contigs. Read-pair and template information can be used in assembly of the contigs, and the reads can be compared with the consensus segments they overlap. Several plots of the assembly coverage and quality can be displayed. Some of the steps in pregap4 can use other programs than provided in the Staden package. A newer version, gap4.new, uses a new probabilistic consensus algorithm, that utilizes the base quality information, cutting down on manual editing.
GAP was originally described in [Bonfield1995].
[GAP4 Homepage]
PCAP: PCAP (Parallel Contig Assembly Program) is an assembler specifically produced to exploit multiple parallel computers. It uses an novel approach to find repeats because it, instead of the exact k-mer identity used by many others, defines a repeat as a excess of overlaps in a read. After repeats have been identified overlaps between reads are evaluated to identify unique overlaps. Next it identifies and removes poor end regions of each read, and the reads are assembled into contigs by using the unique overlaps. The contigs are subsequently corrected and linked into scaffolds with constraints. A multiple sequence alignment of reads are constructed, and a consensus sequence is computed for each contig [Huang2003].
RAMEN: Ramen is used in the assemble of WGS-reads. It basically follows the overlap-layout-consensus paradigm. It starts by trimming reads for low quality data. Additionally the assembler uses: lookup table generation of seed strings for detection of overlapping reads, alignment by banded dynamic programming, a repeat untangling method of transforming a repeat subcontig flanked by two unique subcontigs into one unique contig, and a multiple alignment algorithm utilizing seeds in the lookup table [Mita2004].
STROLL: A fragment assembly performed by STROLL takes place in seven phases. In the first phase a space efficient data structure called a suffix structure is created for both strings of all fragments. All exact matches are then extracted to create a list of candidate fragment overlaps. In the second stage banded pairwise comparisons are performed on each fragment pair in the candidate list. Potential overlaps are generated and missed overlaps are reconstructed using transitive relations. In the third stage the overlaps are classified into multiple levels using base- and alignment qualities. In step four the cleanest fragment are located and in step five the fragment with the best overlap to this is located (if no overlap is found, step four is repeated by starting a new contig). In step six the fragment is seeked added to the contig; if it is not possible, step five is repeated. In step seven the local multiple alignment is optimized and a consensus is generated [Chen2000].
[STROLL Homepage]
TRAP TRAP (Tandem Repeat Assembly Program) is an assembler that tries to distinguish and assemble repetitive regions. It starts by trimming sequences for contaminants and bad quality parts, then it computes candidate overlaps using identical k-mers.
The next part of the algorithm attempts to identify sequence positions that have erroneous basecalls, these are marked as defined nucleotide positions (DNPs). The DNPs are found by multiple alignment of all the overlapping reads at that position, and comparing the differences in bases with those expected from the quality values, if a significant difference is found the reads are likely to be from different repeat regions of the genome.
Using this information TRAP can assemble even repetitive regions with high similarity into contigs, which is performed by a greedy algorithm. However, TRAP needs a considerable coverage to be able to distinguish between different reads [Tammi2003].
mira Assembler The mira and miraEST assemblers are designed to correct ambiguous regions of an assembly by looking at the associated tracefiles, where the mira is used for whole genome shotgun and miraEST is used specifically for EST data.
After sequence cleaning and trimming potential overlaps are found by creating a histogram of distances between 8-mers, and the potential overlaps are examined with a banded Smith-Waterman algorithm. Overlaps are scored and contigs are build iteratively. Mira tries to resolve ambiguous bases in the multiple alignment process by examining the tracefiles for the involved reads. From these it attempts to distinguish true SNPs from faulty basecalls, and use the SNP information to differentiate between repetitive regions [Chevreux2004].
[mira Assembler Homepage]
AMOS AMOS (A Modular OpenSource assembler) is a packkage consiting of numerous modules, each with its own specific purpose. Though it contains two assembly program (Minimus and AMOScmp), the AMOS package is more of a tool to create your own specialized assembler, custom made for the specific chracteristics of you data.
ALLPATHS ALLPATHS is a assembler dedicated to assembly of short reads, ie. reads from massively parallel sequencing technologies. It is designed ot work on paired end data, and uses a graph based methodlogy in assembling the short reads. It constructs the de Bruijn graph from K-mers, and finds all the "unipaths" in a dataset, where a "unipaths" are the maximal unbrached subgraphs that can be constructed. The Assembly is done by choosing seed "unipaths" and extending these using read pair information. [Butler2008]
SHARCGS SHARCGS is another assembler designed for short read assembly. It uses a prefix-tree to look up potential useful reads. The assembly works by choosing a seed read, this seed (contig) is then extended by looking for reads with exact overlap, and extending the contig with this sequence if there are no inconsistensies with other reads. The algorithm was implemented in Perl [Dohm2007]
Velvet Velvet is a graph based short read assembler. It uses user defined (odd numbered) k-mer words for graph construction. The graph is then simplified with the "Tour bus" algorithm, which transverses the graph and resolves ambiguities cause by sequencing errors. After the graph is simplified the "Breadcrumb" algorithm analyzes the (simplified) graph and uses paired end information in an attempt to resolve tangles caused by repeats. [Zerbino2008]
SSAKE SSAKE is short read assembler. It construct a prefix tree of the reads and their reverse complement from the first eleven 5' bases. The tree is sorted by decreasing number of occurences. The assembly is nucleated by unassembled reads, and each read (contig) is extended by searching through the prefix tree. [Warren2007]
VCAKE VCAKE (Verified Consensus Assembly by K-mer Extension) is a contig extension based short read assembler. It uses a treed hash table to key for the first 11 bases of the reads. It uses a seed read to nucleate the contig, and extends the contig one base at a time, using a majority vote to decide on the base, ie. allowing for sequencing errors. First one end of the contig is extended, and thereafter the reverse complement(ie. the other end) is extended. After extension the contig is recorded, and a new unused seed read is used to nucleate an asssembly. [Jeck2007]

EST "assembly"

Here programs that attempt to assemble EST sequences are presented.

TGICL: The TIGR Gene Indices clustering tools (TGICL) is specifically designed for assembly of ESTs. It is the core clustering tool and assembler in TIGRs Gene Indices (TGI) located at: http://www.tigr.org/tdb/tgi/
In the clustering step, a stringent fast pairwise alignment between sequences sharing significant regions of near identity is performed. This is done by using mgblast (modified version of megablast) [Zhang2000], [Pertea2003].
To avoid chimeric assemblies and to help create smaller and better partitioned clusters, known full-length transcripts can be used for 'seeded clustering'. In the seeded clustering it is assumed that a complete gene transcript has nearly perfect identity with all ESTs from that gene; lateral extension of seeds is limited to nearly perfect alignments [Pertea2003].
In the assembly phase each cluster is assembled using the CAP3 assembly program [Huang1999]. The individual assembly of each cluster has the advantage of producing larger, more complete consensus sequences while eliminating potential misclustered sequences [Pertea2003].
[TGICL Homepage]
STACK_Pack: STACK is an abbreviation for: Sequence Tag Alignment and Consensus Knowledgebase [Christoffels2001]. StackPack is a tool to create databases, and was used to create STACK. StackPack creates EST assemblies by first performing clustering with the d2_cluster program. d2_cluster is an agglomerative clustering algorithm, which means that every sequence begins in its own cluster. Sequences are clustered, by counting and identifying matching words of the length 6. Two clusters are merged, if two sequences from each cluster share 100 bp. with 96% similarity. Sequences with less than 50 bp are considered as singletons. Alignments are performed with PHRAP and consensus sequence creation with CRAW [Burke2001] (that additionally identifies alternative spliceforms). Finally, sequences belonging to identical clones are linked.
[STACK_Pack Homepage]
PaCE: Parallel Clustering of E(ST)s. As the name implies PaCE is a tool for rapid clustering of (many) EST on parallel computers. The program seeks to cluster ESTs that belong to the same gene or paralogous genes. After clustering a putative consensus sequence can be generated by using an assembler to process the individual clusters.
The program builds a distributed representation of a generalized suffix tree data structure based on the EST sequences. In the beginning, each cluster consists of one EST. Then it performs an early identification of EST pairs that are likely to merge clusters. This is done in parallel, by generating pairs ranked after their maximal common substring length. If an EST pair show a significant pairwise alignment, the clusters where the ESTs originate are merged. This process is continued until no further merges are possible. In this way it is not necessary to make pairwise alignments of all EST combinations. PaCE can be extended to be able to predict alternative splicing sites, but does not use quality values [Kalyanaraman2003].
POA: POA (Partial Ordered Alignment) is designed to create multiple sequence alignments. It simplifies the alignment process by representing alignments as a partially ordered graphs, where the sequence letters represent nodes with edges connecting them, in accordance with the sequences the graph represents. The alignment is performed by extending the Smith-Waterman algorithm so that it can follow different paths along the graph (which corresponds to alignment to different original sequences), and then it iteratively appends the new sequence alignments to the graph.
One advantage to this approach, as opposed to the traditional consensus motif approach, is that no information is lost when the graph is constructed. Another benefit is that for similar sequences the computational time is reduced. POA has been used to investigate similarities in protein domain and EST sequences [Lee2002].
[POA Homepage]
HMM sampling: Sampling performed with a SLAM [Alexandersson2003] generalized pair hidden Markov model can be used to identify alternative splicing structures for genes. The sampling method can be thought of as a method to find alternative splicing independent of EST evidence. SLAM models exon/intron structures and conserved non-coding sequences in a pair of syntenic DNA sequences. The model performs both alignment and gene finding at the same time [Cawley2003].
Distiller: Distiller is a algorithm created specifically for assembly and clustering of ESTs. It masks sequences for contaminants, and low quality regions. The clustering method uses double linkage constraints to minimize overclustering formed by chimeric reads, instead of the traditional transitive single linkage clustering. It uses a novel method to generate consensus sequences, where it splits up the (ungapped) multiple alignment into consecutive 12-mers, the consensus sequence is then found as the most common 12-mer for each segment. By using this method it becomes relatively easy to identify SNPs, alternate splicing and paralogous genes, as these give a clear signal when a given 12-mer is is represented by several ESTs in the same cluster.
Due to the more stringent clustering method distiller can have problems clustering EST from lightly expressed (low coverage) genes. It was used to analyze EST sequences from the Xenopus Tropicalis project [Gilchrist2004].

After the primary assembly into contigs have been performed larger structures (scaffolds) cen be constructued from the individual contigs, linking the contigs together by using information other than the sequence reads.

Bambus: Bambus is a hierarchical general purpose scaffolder designed to perform the scaffolding on contigs provided by another assembler [Pop2004].
As Bambus is designed to work independently of the assembler, it is possible for the user to influence the scaffolding algorithm and the parameters used therein. Bambus typically uses clone mate information to infer links between contigs, but other sources of information can be used in scaffolding. It uses a greedy algorithm to assign contigs to a scaffold, but the order in which the contigs are processed is decided by parameters provided by the user.
A novel feature of Bambus is that it does not necessarily create a linear scaffold at all costs, but instead marks ambiguous parts of the scaffold for further analysis. These ambiguities (tangles) can be a result of repeats or different haplotypes being sequenced, and as such these tangles can give further information. A part of the bambus package Untangle is able to process the ambiguous parts into linear scaffolds [Pop2004]. Bambus is now part of the AMOS package.
GigAssembler: GigAssembler is meant to work after another assembly method has assembled the reads of each draft-sequenced large insert clone into a set of initial sequence contigs. GigAssembler was build for the public working draft assembly of the human genome and is also referred to as The Goldenpath Assembly [Rouchka2002].
First the sequence is decontaminated and repeat masked using RepeatMasker. Then mRNA, EST, BAC end and paired plasmid reads are aligned against the initial sequence contigs, by building a list of 10-mers, and then aligning likely candidates. In the third step an input directory structure is created by using the Washington University map and other data. After this, the initial sequence contigs within each fingerprint clone contig are aligned against each other. Then GigAssembler merges the overlapping initial sequence contigs within each fingerprint clone contig. The resulting contigs are then ordered and oriented into scaffolds. Finally, the contig assemblies are combined into full chromosome assemblies. The program is unable to detect misassemblies or chimerism in the initial sequence contigs [Kent2001].

Other programs

Below is a description of other software resources that can be used in further analysis of genomic data.

ASmodeler: ASmodeler is a web based tool that builds gene models (with alternative splicing events) inferred from genomic alignment of user-supplied mRNA, EST and protein sequences by ECgene algorithm. Some additional features are implemented: The protein sequences can be aligned to make comparative gene modeling. ASmodeler can be used as a transcript assembler for the UniGene clusters. Gene predictions from the UCSC genome database can be included in the transcript building procedure. GenBank sequences are prealigned against the genome and can be included in the clustering procedure.
First the input sequences are aligned against the genome by using BLAT C/S to find suspicious alignments. Suspicious alignments are corrected by using SIM4. Next the EST clustering is performed. Only sequences with introns are considered, since most contaminated ESTs are expected to be unspliced. Sequences sharing a splice site are grouped together to produce primary clusters, allowing variations within +/- 6 bps. The connectivity of exons in each primary cluster is represented as a directed acyclic graph. All possible paths along exons are found using the depth-first search method. Each path represents a potential splice variant. Exons without clone coverage are trimmed away from the model, and the result are compared with other gene models as a redundancy check. The presence of polyA tails and the GT-AG consensus in the intron sequences, are used to determine the gene boundary and the direction of the gene. Unspliced sequences with correct annotation are then added, not changing the exon-intron boundaries of existing gene models. The primary cluster obtained until this stage corresponds to multi-exon genes, whose subclusters represent splice variants. The remaining unspliced sequences are further clustered according to overlap in the genomic loci. The resulting clusters, representing single exon genes, are added to the list of primary clusters [Kim2004].
[ASmodeler Homepage]
ASP database: The Alternatively Spliced Protein database was constructed by using a graph-based algorithm to reconstruct the most likely set of full-length isoform sequences from a mixture of EST fragment data [Xing2004].
The algorithm uses POA (Partial Ordered Alignment) to align the EST (and mRNA) data to the genomic sequence, using full dynamic programming with gap penalties that allow intron sequences. After this adjacent interval that appear to be exons are merged, and a splice graph, with exons as nodes, is constructed [Heber2002]. The splice graph is then processed to find the most likely spliceform (the isoform), with the Heaviest Bundling (HB) algorithm.
The transcript sequence of each isoform is constructed by assembling the genomic sequence intervals that constitute its exons. The protein sequences are generated by searching for the longest ORFs within each transcript [Xing2004].
[ASP database Homepage]
MisEd: MisEd uses Defined Nucleotide Positions (DNPs) to correct errors in shotgun sequence data. The method distinguishes single base differences from sequencing errors, by analyzing multiple alignments consisting of a read and all its overlaps with other reads. The construction of multiple alignments is performed using a pattern matching algorithm, which takes advantage of the symmetry between indices, that can be computed for similar words of the same length. This allows for rapid construction of multiple alignments, with no previous pairwise matching of sequence reads required [Tammi2003b].
[MisEd Homepage]
Splicing Graphs: Splicing graphs is a tool that makes it possible to visualize EST reads assembled into one graph, instead of into many different splicing variants. They are similar to gene models, that represent exons connected by edges if they are consecutive in a transcript. If the genomic template sequence of an EST cluster is known, each EST is mapped to the genomic sequence, and each transcript will correspond to a set of genomic positions. The splicing graph will then be the directed graph on the set of transcribed positions, where 2 nodes are connected if they correspond to 2 consecutive positions in one of the transcripts. Every path in the splicing graph corresponds to one transcript. The whole splicing graph is the union of these paths. However, splicing graphs can be built solely from transcript data without any knowledge of the genomic sequence. The splicing graph is then constructed from the set of all k-words and their reverse complements contained in the EST cluster. Two k-words are connected if they occur consecutively in a transcript. The graphical viewing tool LEDA is used to visualize the splicing graph [Heber2002].
[Splicing Graphs Homepage]

Review of assembly programs

Supplementary material to article reference

Assembly programs

EST "assembly"

Scaffolders

Other programs

Links to assembler webpages