metaRNAmodules

Automated RNA 3D module extraction and modeling.

Introduction

Recent progress in predicting RNA structure is taking a route towards not only explicit prediction of RNA 3D structure, but also filling the 'gap' in 2D RNA structure prediction where, for example, predicted internal loops often can take a structure based on non-canonical base pairs. This is increasingly recognized with the steady increase of known RNA 3D modules. There is a general interest in matching modules from one molecule to other molecules for which the 3D structure is not known. However, a major challenge is to determine whether the module is trustworthy in the first place. Another challenge is that module recognition and modeling require time consuming manual interference. We have created a pipeline, metaRNAmodules, which completely automates extracting putative FR3D modules and mapping of such modules to Rfam alignments to obtain comparative evidence. In a subsequent step a module represented as a two-dimensional graph is fed into the RMDetect program to test the discriminative power on real and randomized Rfam alignments. An initial extraction of 22495 3D modules in all PDB files results in 977 internal loop and 17 hairpin loop modules with clear discriminative power. Many of these modules describe only minor variants of each other. Indeed, mapping of the modules onto Rfam families results in 35 unique locations in 11 different families.

Download

The standalone version of the metaRNAmodules pipeline and data is available for download here.

metaRNAmodules-1.0.2.tar.gz release August, 2013

metaRNAmodules-0.1.1.0.tar.gz release January, 2013 (outdated)

Installation

After downloading the metaRNAmodules package, unpack the package and change into the main metaRNAmodules directory by following the steps given below.

tar -xzvf metaRNAmodules-1.0.2.tar.gz
cd metaRNAmodules-1.0.2

Requirements

Tools
Before you can start the pipeline, it is mandatory to install the following tools. Note that it is important to install the indicated version of a tool to make sure that the input is processed correctly.

- Infernal-1.0.2  ("INFERence of RNA ALignment") for the use of cmalign [1]
- CLUSTAL W-2.1   ("Multiple Sequence Alignment") [2]
- RMDetect-0.0.3  ("3D RNA modules detection on genomic sequences") [3]
- Perl v5.8.8     ("Programming language")
- MULTIPERM-0.9.3 ("Shuffling multiple sequence alignments") [4]

metaRNAmodules requires a list of environment variables given below. The location of the tools binary directory needs to be assigned to the environment variable. For example in the bash terminal, this can be assigned as

export CMALIGNPATH='<PATH>'
export CLUSTALWPATH='<PATH>'
export RMDETECTPATH='<PATH>'
export MULTIPERMPATH='<PATH>'

You can add this lines to the .bashrc file available in the home directory to avoid executing the above command every time when starting metaRNAmodules on a new terminal.

For the usage of PutativeModules from the RNAmodules package which is part of the pipeline read the README file in the RNAmodules folder.

RMdetect requires some further tools, like

- Vienna Package version 1.8.4 or greater for the use of RNAfold and RNAalifold
- Python 2.5 or greater
and optional
- Psyco
- Numpy
- Matplotlib

Please follow the installation instructions of RMDetect.

Usage

All parameters of metaRNAmodules-1.0.2 are optional. Without any options the pipeline runs in the example mode, without using PutativeModules and on the small example files (see Examples).
With option (-o, --outdir) you can specify the path to a main output directory where all output files are stored. You can also give the new models a name (-n, --name) and a version number (-x, --x). These are used by RMDetect when scanning an alignment.
If you have problems to run the PutativeModules tool, you can run the pipeline without PutativeModules on an example output file. Set option -p 'n' or run without the -p option.
You can decide if you want to run metaRNAmodules either on a small example (24 FR3D files, -e 1) or on a big example (11680 FR3D files, -e 2).
WARNING: if you run on the big example it takes several days to produce the output.
Furthermore, you can give an alignment in STOCKHOLM format as input. The alignment will be scanned with the new model. Subsequently, the alignment will be shuffled (default: 25x). The shuffled data will be scanned with the new model and the RMDetect score distribution is used as null model. The output is the delta_sco measurement value for these distributions.

Syntax

perl metaRNAmodules-1.0.2.pl [options]

Available options

-f, --fr3d      <str>      Directory containing the FR3D files [default: ./FR3D_examples/]
-o, --outdir    <str>      Directory containing the output of metaRNAmodules 
                           [default: ./metaRNAmodules_Out/]
-n, --name      <str>      Name of the new model [default: MYMODEL_fr3d]
-x, --x         <float>    Version number of the new model [default: 1.0]
-a, --ali2scan  <str>      File containing an alignment [mandatory format: STOCKHOLM]
-h, --help                 Print help file and exit
-V, --version              Print version and exit

Example command line call on the small example file

perl metaRNAmodules-1.0.2.pl -p n -e 1

Input formats
All example files and input files should be zipped (gzip -9 ) with suffix .gz.
Alignments need to be in STOCKHOLM format with the ending .sto or .stk. Below you can see an example STOCKHOLM file.

# STOCKHOLM 1.0
#=GF ID    UPSK
#=GF SE    Predicted; Infernal
#=GF SS    Published; PMID 9223489
#=GF RN    [1]
#=GF RM    9223489
#=GF RT    The role of the pseudoknot at the 3' end of turnip yellow mosaic
#=GF RT    virus RNA in minus-strand synthesis by the viral RNA-dependent RNA
#=GF RT    polymerase.
#=GF RA    Deiman BA, Kortlever RM, Pleij CW;
#=GF RL    J Virol 1997;71:5990-5996.

AF035635.1/619-641             UGAGUUCUCGAUCUCUAAAAUCG
M24804.1/82-104                UGAGUUCUCUAUCUCUAAAAUCG
J04373.1/6212-6234             UAAGUUCUCGAUCUUUAAAAUCG
M24803.1/1-23                  UAAGUUCUCGAUCUCUAAAAUCG
#=GC SS_cons                   .AAA....<<<>>>
//

Output

If no output directory is given, metaRNAmodules creates an output directory called 'metaRNAmodules_Out'. In this directory you can find a directory called 'Tmp_Modules' which contains files called for example 'moduleClass_758.mod.gz'. The PutativeModule tool extracts all possible modules from the FR3D files and saves them in one output file [default name: PutativeModules.txt]. metaRNAmodules splits the file in several single files according to the structures of the modules. Each file contains one or several modules which have the same dot-bracket notation. See the example file PutativeModules_example.txt.gz in the source directory for an example PutativeModule output and compare it to the example files ./metaRNAmodules_Out/Tmp_Modules/moduleClass_1222.mod.gz and metaRNAmodules_Out/Tmp_Modules/moduleClass_758.mod.gz The number in the moduleClass file name (1221 or 758) denotes the class to which a module belongs to, i.e. it is valid for all modules within a moduleClass_*.mod file.
The output directory furthermore contains directories for each module. They are named by the Rfam family the module maps on, the structure class number, the PDBid and the positions in the ungapped seed alignment where the module maps on (for example RF00015_1222_2OZB_28_34_42_45). In each of these directories you can find a bundle of files produced by metaRNAmodules.

- MYMODEL_fr3d_1.0.data (data file for Bayesian network training)
- MYMODEL_fr3d_1.0.def (contains all nodes and edges for the Bayesian network)
- mymodel_fr3d_1.0.model (new model file)
- RF00015_mod_clean_refseq.stk.gz (cleaned, modified Rfam alignment containing additionally a 
reference sequence for the module)
- RF00015_mod_clean.stk.gz (the basic alignment where the module maps on)
- RF00015_1222_2OZB_28_34_42_45.module.gz (file with the details of the module, see below)
- test.rmdout.gz (RMDetect output for the scanned alignment)
- test.summary.rmcout.gz (RMCluster output for the scanned alignment, overview of clusters)
- cluster_test.summary.gz (RMCluster output for the scanned alignment)
- shuffled.rmdout.gz (RMDetect output for the shuffled alignment)
- shuffled.summary.rmcout.gz (RMCluster output for the shuffled alignment, overview of clusters)
- cluster_shuffled.summary.gz (RMCluster output for the shuffled alignment)
- ShuffledAlignments (directory with the shuffled alignments, default shuffling number is 25)
- shuffled_rmdout_fr3d.scores (contains all scores for the shuffled alignments, extracted from 
shuffled.rmdout.gz)
- test_rmdout_fr3d.scores (contains all scores for the scanned alignment, extracted from 
test.rmdout.gz)
- RF00015_1222_2OZB_28_34_42_45.stats (contains for each quantile value p=0.8, p=0.85, 
p=0.90, p=0.95 several details, see below)

Example module detail file: ('.module.gz')
Below you see an example module detail file. It shows consecutively
- PDB id
- mapped Rfam family
- structure class
- organism of the (full-) reference sequence
- the position of the putative module in the ungapped reference sequence
- the position of the putative module in the gapped seed alignment plus additionally aligned reference sequence
- the position of the putative module in the gapped modified, cleaned seed alignment
- the position of the putative module in the gapped original seed alignment without the reference sequence
- PDB positions of the putative module
- PDB residue numbers of the putative module
- putative module sequence ('&' is the separator of the two putative module parts)
- putative module in dot-bracket notation ('&' is the separator of the two putative module parts)
- base pairs and base pair type of the putative module (NOTE: base of the first base pair starts with '0', e.g. base 0 pairs with base 10 referencing to the putative module sequence)

******************************

PDB: 2OZB
RFAM: RF00015
CLASS: 1222
ORG: AABR04000514/122021-122162
Ungapped reference sequence position: 28-34/42-45
Gapped seed alignment position (+full): 30-36/46-49
Gapped cleaned seed alignment position (-full): 30-36/46-49
Gapped original seed alignment position (-full): 30-36/46-49
PDB position: 9-15/23-26
PDB residues: 28-34/42-45 
Putative Module: CAAUGAG&CGAG
Dot-Bracket notation: (...(((&))))
BP 1: 7 6       (cWW)
BP 2: 8 5       (tHS)
BP 3: 9 4       (tSH)
BP 4: 10        0       (cWW)

********************************

Example statistic file line: ('.stats')
Below you can see an example line of the statistics file. The explanation of each column is given below the line.

80 RF00015_1222_2OZB_28_34_42_45 11 5 .45454545454545454545 7.611224 33 34

1. Quantile value p
2. Module name
3. Module length
4. # base pairs
5. Complexity (#base pairs / module length)
6. Delta_sco (discrimination measurement)
7. # of datapoints in 1-Q_p (right tail of distribution) for scanned alignment
8. # of datapoints in 1-Q_p (right tail of distribution) for shuffled alignment

Examples

You can run metaRNAmodules for demonstration on example files. The pipeline will extract modules from FR3D files, map them on cleaned modified alignments and build the model. After the model has been build, the pipeline scans an alignment, shuffles this alignment and scans the shuffled data resulting in the discrimination measurement delta_sco.
Example files are located in the following directories:

- ./FR3D_examples_all
- ./FR3D_examples_small

./FR3D_examples_all contains 11680 FR3D database files. They are scanned by PutativeModules to extract putative modules. There are always eight files that belong to one PDB identifier. They contain different information (see [5]). WARNING: if you run on the big example it could take several days to produce the output. ./FR3D_examples_small contains 24 FR3D database files for PDB identifier 1L9A, 2GIS and 2OZB used by PutativeModules to scan for putative modules. For demonstration purposes it is indicated to run the pipeline on the small example files. It will take only some minutes to get the results. Further example files:

- ./PutativeModules_small.txt.gz (small example)
- ./PutativeModules_all.txt.gz (big example)

If you want to run the pipeline without using PutativeModules, metaRNAmodules takes either a small or a big example output file of PutativeModules as input. Both files are located in the main metaRNAmodules directory. PutativeModules_small.txt.gz contains putative modules for the same 24 FR3D files located in ./FR3D_examples_small and PutativeModules_all.txt.gz contains putative modules of 11680 FR3D files.

Other files and directories in the top-level directory:

- lib (contains additional source packages)
- metaRNAmodules_mapping (contains the source code for the mapping part)
- Rfam_clean (contains structure tables in txt format and the rfam_keywords.txt provided by Rfam.)
- RFAMfull10.1_STOCKHOLM (contains full Rfam alignment in Stockholm format)
- RFAMseed10.1_STOCKHOLM (contains seed Rfam alignment in Stockholm format)
- RFAMseed10.1CleanedAlignments (contains cleaned Rfam seed alignments. 'Cleaned' denotes alignments
which are 95% redundancy reduced and have a minimum of 30 sequences.)
- RFAMseed10.1CovarianceModels (contains covariance models v_1.0.2 for cleaned modified Rfam families
used by cmalign.)
- RFAMseed10.1CovarianceModels_orig (contains original covariance models v_1.0.2
used by cmalign)
- README file (how to run metaRNAmodules)
- NewModels (see below)

New Models

A directory called 'NewModels_representatives' with 28 new models is available in the metaRNAmodules directory. These models are located on 28 different locations on 10 Rfam families. They are the representatives of the locations with maximal delta_sco. These models can be used with RMDetect to scan RNA single sequences and multiple alignments for further occurences of the module.

If you are interested in all 1982 newly generated models, you can download them here.

Supplementary material

Click here for the supplementary material referred in the original paper.

References

Nawrocki, E.P. and Kolbe, D.L. and Eddy, S.R. (2009). Infernal 1.0: Inference of RNA alignments , Bioinformatics 25:1335-1337.
Thompson, J.D. and Higgins, D.G. and Gibson, T.J. (1994). CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice, Nucleic Acids Research 22(22):4673--4680.
Cruz, J.A. and Westhof, E. (2011). Sequence-based identification of 3D structural modules in RNA with RMDetect, Nature Methods 8(6):513--519.
Anandam, P. and Torarinsson, E. and Ruzzo, W.L. (2009). Multiperm: shuffling multiple sequence alignments while approximately preserving dinucleotide frequencies, Bioinformatics 25(5): 668-669.
Sarver, M. and Zirbel, C.L. and Stombaugh, J. and Mokdad, A. and Leontis, N.B. (2008). FR3D: Finding Local and Composite Recurrent Structural Motifs in RNA 3D Structures, Journal of Mathematical Biology 56:215-252.

If you find this software useful for your research, please cite the following work:

Theis C, Höner zu Siederdissen C, Hofacker IL, Gorodkin J (2013). Automated identification of RNA 3D modules with discriminative power in RNA structural alignments. Nucleic Acids Research 41(22):9999-10009.

Contact

For any comments or bug reports please contact the authors. Email: corinna@rth.dk, choener@tbi.univie.ac.at