Please contact Stefan Seemann (seemann@rth.dk) for any technical issues or general questions about the software.
PETcofold is an integrated framework with PETfold to fold and search for RNA-RNA interactions between two multiple alignments of RNA sequences.
PETcofold predicts the joint secondary structure of two RNA alignments including RNA-RNA interactions with maximum expected accuracy (MEA). PETcofold is an extension of PETfold which is the first tool integrating the duality of energy-based and evolution-based approaches into a single optimization problem for folding of aligned RNA sequences.
PETcofold, like PETfold, applies Pfold which identifies base pairs that are most conserved and energetically most favorable using a maximum expected accuracy scoring. The free energies of single stranded and base paired positions are taken from RNAfold. Inter-molecular thermodynamic folding probabilities are calculated by RNAcofold.
The PETcofold pipeline consists of two steps: (1) intra-molecular folding of both alignments by PETfold and selection of a set of highly reliable base pairs (partial structure); (2) inter-molecular folding of the concatenated alignments by an adapted version of PETfold using constraints from step 1. In the end, partial structures from step 1 and constrained inter-molecular structures from step 2 are combined to the RNA-RNA joint structure including pseudoknots.
The PETcofold algorithm is described in:
Seemann SE, Richter AS, Gorodkin J, Backofen R "Hierarchical folding of multiple sequence alignments for the prediction of structures and RNA-RNA interactions", Algorithms Mol Biol., 5:22, 2010.
and the application on known RNA-RNA interactions is presented in:
Seemann SE, Richter AS, Gesell T, Backofen R, Gorodkin J "PETcofold: Predicting conserved interactions and structures of two multiple alignments of RNA sequences", Bioinformatics, 27(2):211-9, 2011.
PETcofold reads two RNA sequence alignments in FASTA format whereas only identifiers are considered that exist in both alignments. At least three unique identifiers are necessary.
>gca_bovine AGCCCUGUGGUGAAUUUACACGUUGAAUUGGGGGCUU >gca_chicken GACUCUGUAGUGAAGU-UCAUAAUGAGUUGGGGGUCU >gca_mouse GGUCUUAAGGUGAUA-UUCAUGUCGAAUUGGAGACUU >gca_rat AGCCUUAAGGUGAUU-AUCAUGUCGAAUUGAGGGCUU
The user can type the alignments in the text areas, she can upload FASTA-files, she can select a MAF block, or/and an Rfam family from a drop down list.
The 17way MULTIZ multiple alignments of the human genome (hg18, Mar. 2006) are accessable by chromosome, start position, end position, and strand. After pressing the Update bottom, either a drop down list appears below with all alignment regions (MAF-blocks) in human that are covered by the query, or the following text area shows the alignment if only one MAF-block is covered by the query. In the latter case all MAF-block identifiers are offered as input for the 2nd aligment in another drop down list. If the query covers several MAF-blocks the user has to select one MAF-block from the drop down list and confirm it by pressing the Update bottom again. Now the text area with the MAF-block alignment and the drop down list with all its identifiers should appear. The user can choose multiple identifiers which will appear in the text area of the 2nd alignment. Afterwards the sequences for the 2nd alignment can be added by hand. The settings are reset by choosing "---" from the chromosome drop down list.
The seed alignment of one Rfam 10.0
family can be chosen from a drop down list as 2nd multiple sequence alignment.
If the seed contains paralogs then we choose the sequence that has a distance
closest to the mean distance in the phylogenetic tree. After pressing the
Update bottom, the following text area shows the alignment and all its
identifiers are offered as input for the 1st aligment in another drop down list
below. The user can choose multiple identifiers which will appear in the text
area of the 1st alignment. Afterwards the sequences for the 1st alignment can
be added by hand.
If the user already selected a MULTIZ alignment as 1st
alignment then the drop down list offers only Rfam seed alignments which contain
at least 1 species from the selected MAF-block. The settings are reset by
choosing "---" from the Rfam family drop down list.
By default, a phylogenetic tree is calculated in step 1 for both alignments and in step 2 for the concatenated alignment from pairwise distances using the neighbour joining (NJ) algorithm. However, the user can specify an own tree that will be used in both steps. The tree must be in Newick format, which is described here. The node names must be the same as the sequence names in the alignment files. If the branch lengths are not given then they are estimated by maximum likelihood.
((gca_rat:0.61783,gca_mouse:0.070947):0.29012,gca_chicken:0.372963,gca_bovine:0.159582):0.001Tree with 4 species and without branch lengths:
((gca_rat,gca_mouse),gca_chicken,gca_bovine)
If the Pseudo-knot free consensus secondary structure of the first and/or second RNA alignment is already known then step 1 of the PETcofold algorithm looks only for highly reliable base pairs in these structures. The structure must be in dot bracket notation, which is described here.
((((((....((......))..........)))))).
If the secondary structure of the RNA duplex including the RNA-RNA interaction is already known then its PETcofold score and base pair reliabilities can be calculated too. Step 1 constraints intra-molecular structures marked as '(' and ')' and step 2 inter-molecular structure marked as '[' and ']'. Both structures have to be concatenated by '&'.
.((((.................[[[[[[[[.))))..&..((((((.]].]]]]]].......)))))).
The PETcofold algorithm is optimized by several parameters. The advanced user has the possibility to change them (values between 0 and 1 for all parameters except extstem which is boolean):
The following parameters are PETfold specific. They influence the scoring scheme and the impact of the evolutionary model on the structure prediction.
If the phylogenetic tree was not given then the phylogeny of the sequences in the RNA alignment is calculated using the neighbour joining approach. Then the branch length are estimated by a maximum likelihood approach. The latter is also done if the tree was submitted without distances between the nodes. The output presents the joint phylogenetic tree of the concatenated RNA multiple alignments. The output consists of a PNG image and the alternative download links to a PS and PDF file and the newick format.
The plain text output shows the command-line output in step 1 and step 2 of the program. The output of step 1 shows the partial structure with its probabilities in the thermodynamic and evolutionary model for both RNA alignments for different parameters Delta. Remember that step 1 is repeated for increased values of Delta until the partial structure probability is greater Gamma either in the thermodynamic or the evolutionary model or both. The output of step 2 shows in the first line the intra-molecular consensus RNA secondary structures of both RNA alignments predicted by PETfold. The second line shows the joint RNA secondary structure of the concatenated alignments predicted by PETcofold. The constrained base pairs from step 1 (partial structures) are indicated by pairs of “{” and “}”, intra-molecular base pairs predicted in the second step by pairs of “(” and “)”, and interaction sites (intra-molecular base pairs) by pairs of “[” and “]”. The third line again shows the interaction sites. The last line shows the score of the joint RNA secondary structure with maximal expected accuracy (score is normalized by the alignment length).
The concatenated alignments are shown as PNG figure annotated by sequence position, the consensus structure, colored base pairs using the Vienna RNA conservation coloring schema, the reliability of each nucleotide to be base paired (reliab_paired) and the sequence conservation. The compensatory mutation supporting the consensus structure are marked by color. The color scheme is the same employed by RNAalifold and alidot: Red marks pairs with no sequence variation; ochre, green, turquoise, blue, and violet mark pairs with 2,3,4,5,6 different tpyes of pairs, respectively. Paired reliability and conservation are drawn as barplots with values from 0 to 1. The figure is created by an adapted version of the Vienna RNA Utility coloraln.pl. In addition, the consensus secondary structure is shown as PNG figure whereas red lines are inter-molecular bindings and blue arcs are intra-molecular bindings. The sequence data of both alignments is presented as sequence logo whereas the size of nucleotides is proportional to their relative occurrence in the alignment column. This figure was generated with RILogo. Alternatively the figures can be downloaded as SVG, PS or PDF file.
The PETcofold reliabilities of all possible base pairs are illustrated as rectangles in the upper triangle of the PNG figure. The base pairs which are part of the predicted joint consensus RNA secondary structure are illustrated as rectangles in the lower triangle. The two bold lines seperate intra-molecular from inter-molecular base pairs. The latter are shown in the upper right as well as lower left rectangle. On the axes the single stranded reliabilities calculated by the PETcofold scoring scheme are shown as well as rectangles. The size of the rectangles are proportional to the reliabilities. Alternatively the figure can be downloaded as PS or PDF file or the reliabilities as plain text file.