metaRNAmodules - Automated RNA 3D module extraction and modeling
http://rth.dk/resources/mrm/
Version 0.1.1.0; January 2013
------------------------------------------------------------------



SUMMARY
=======

This constitutes the mapping and modeling part of metaRNAmodules.
The 'metaRNAmodules' executable starts with extracting putative modules from 
the FR3D database. For details see the RNAmodules/README.
Putative modules are mapped on modified, cleaned Rfam alignments which serve as 
base for the training of the Bayesian network models. Subsequently, RMDetect is 
applied with the new model on an alignment and an shuffled alignment to get the
discrimination measurement value delta_sco.

metaRNAmodules is written in Perl and has been tested on Linux (Ubuntu).



LICENSE
=======

metaRNAmodules libraries and executables are available under the GNU General Public License (GPLv3).



AVAILABILITY
============

metaRNAmodules is available from 'http://rth.dk/resources/mrm/'.
To run the pipeline follow the steps below.
metaRNAmodules assumes availability of tools described under 'REQUIREMENTS'.




INSTALL
=======

To run the pipeline just extract the tar file and change into the main 
metaRNamodules directory.

1. extract metaRNAmodules: "tar xf metaRNAmodules-0.1.1.0.tar.gz"
2. "cd metaRNAmodules-0.1.1.0"





REQUIREMENTS
============

Mandatory requirements:

The following list shows tools that are mandatory to be installed when 
running the pipeline. Note that it is important to install the required version 
of a tool to make sure that the input is processed correctly. 


- Infernal-1.0.2  ("INFERence of RNA ALignment") [1]
- CLUSTAL W-2.1   ("Multiple Sequence Alignment") [2]
- RMDetect-0.0.3 ("3D RNA modules detection on genomic sequences") [3]
- Perl v5.8.8    ("Programming language")
- MULTIPERM-0.9.3 ("Shuffling multiple sequence alignments while approximately 
                    preserving dinucleotide frequencies") [4]

metaRNAmodules requires a list of environment variables given below. The location of the tools binary directory needs to be assigned to the environment variable. For example in the bash terminal, this can be assigned as,


export CMALIGNPATH='<PATH>'
export CLUSTALWPATH='<PATH>'
export RMDETECTPATH='<PATH>'
export MULTIPERMPATH='<PATH>'


You can add this lines to the .bashrc file available in the home directory to avoid executing the above command every time when starting metaRNAmodules on a new terminal.

For the usage of RNAmodules which is part of the pipeline read the README file in the RNAmodules folder.

RMdetect requires some further tools, like
- Vienna Package version 1.8.4 or greater for the use of RNAfold and RNAalifold
- Python 2.5 or greater
and optional
- Psyco
- Numpy
- Matplotlib
Please follow the installation instructions of RMDetect.





USAGE
=====

perl metaRNAmodules.pl [options]


options:

-o, --outdir    		Directory containing the output of metaRNAmodules [default: ./metaRNAmodules_Out]
-n, --name      		Name of the new model [default: MYMODEL_fr3d]
-x, --x         		Version number of the new model [default: 1.0]
-a, --ali2scan  		File containing an alignment [mandatory format: STOCKHOLM]
-p, --putativeModule    Run PutativeModule y/n [default: n]
-e, --exam              Use example [1=small, 2=all, default=1]
-h, --help      		Print help file and exit
-V, --version   		Print version and exit


All parameters are optional. Without any options the pipeline runs in the example mode without using PutativeModules and on the small example files (see 'Examples' below).
With option (-o, --outdir) you can specify the path to a main output directory where all output files are stored.
You can also give the new models a name (-n, --name) and a version number (-x, --x). These are used by RMDetect when scanning an alignment.
If you have problems to run the PutativeModules tool, you can run the pipeline without PutativeModules on an example output file. Set option -p 'n' or run without the -p option.
You can decide if you want to run metaRNAmodules either on a small example (3 FR3D files) or on a big example (1460 FR3D files).
WARNING: if you run on the big example it takes several days to produce the output.
Furthermore, you can give an alignment in STOCKHOLM format as input. The alignment will be scanned with the new model. Subsequently, the alignment will be shuffled (default: 25x). The shuffled data will be scanned with the new model and the RMDetect score distribution is used as null model. The output is the delta_sco measurement value for these distributions.








INPUT FORMATS
=============

All example files and input files should be zipped (gzip -9 <file>) with suffix .gz.
The alignment file which is scanned with the new model needs to be in STOCKHOLM format.

This is an example Stockholm file:
# STOCKHOLM 1.0
#=GF ID    UPSK
#=GF SE    Predicted; Infernal 
#=GF SS    Published; PMID 9223489
#=GF RN    [1]
#=GF RM    9223489
#=GF RT    The role of the pseudoknot at the 3' end of turnip yellow mosaic
#=GF RT    virus RNA in minus-strand synthesis by the viral RNA-dependent RNA
#=GF RT    polymerase.
#=GF RA    Deiman BA, Kortlever RM, Pleij CW;
#=GF RL    J Virol 1997;71:5990-5996.

AF035635.1/619-641             UGAGUUCUCGAUCUCUAAAAUCG
M24804.1/82-104                UGAGUUCUCUAUCUCUAAAAUCG
J04373.1/6212-6234             UAAGUUCUCGAUCUUUAAAAUCG
M24803.1/1-23                  UAAGUUCUCGAUCUCUAAAAUCG
#=GC SS_cons                   .AAA....<<<<aaa....>>>>
//





OUTPUT
======


It is possible to define an output directory. If no directory is given, metaRNAmodules 
creates an output directory called 'metaRNAmodules_Out'. In this directory you can find 
a directory called 'Tmp_Modules' which contains files called for example 'moduleClass_758.mod.gz'.
The PutativeModule tool extracts all possible modules from the FR3D files and saves them in
one output file [default name: PutativeModules.txt]. metaRNAmodules splits the file in several 
single files according to the structures of the modules. Each file contains one or several putative modules 
which have the same dot-bracket notation. 
See the example file PutativeModules_small.txt.gz in the source directory for an example 
PutativeModule output and compare it to the example files  
./metaRNAmodules_Out/Tmp_Modules/moduleClass_1222.mod.gz and 
metaRNAmodules_Out/Tmp_Modules/moduleClass_758.mod.gz
The number in the moduleClass file (1221 or 758) denotes the class to which a module belongs to, 
i.e. it is valid for all modules within a moduleClass_*.mod file.

The output directory furthermore contains directories for each putative module. They are named by the 
Rfam family the module maps on, the structure class number, the PDBid and the positions of the 
ungapped seed alignment where the module maps on (for example RF00015_1222_2OZB_28_34_42_45).
In each of these directories you can find a bundle of files produced by metaRNAmodules.


- MYMODEL_fr3d_1.0.data (data file for Bayesian network training)
- MYMODEL_fr3d_1.0.def (contains all nodes and edges for the Bayesian network)
- mymodel_fr3d_1.0.model (trained model)
- RF00015_mod_clean_refseq.stk.gz (cleaned, modified Rfam alignment containing additionally a 
reference sequence for the module)
- RF00015_mod_clean.stk.gz (the basic modified, cleaned alignment where the module maps on)
- RF00015_1222_2OZB_28_34_42_45.module.gz (file with the details of the module, see below)
- test.rmdout.gz (RMDetect output for the scanned alignment)
- test.summary.rmcout.gz (RMCluster output for the scanned alignment, overview of clusters)
- cluster_test.summary.gz (RMCluster output for the scanned alignment)
- shuffled.rmdout.gz (RMDetect output for the shuffled alignment)
- shuffled.summary.rmcout.gz (RMCluster output for the shuffled alignment, overview of clusters)
- cluster_shuffled.summary.gz (RMCluster output for the shuffled alignment)
- ShuffledAlignments (directory with the shuffled alignments, default shuffling number is 25)
- shuffled_rmdout_fr3d.scores (contains all scores for the shuffled alignments, extracted from 
shuffled.rmdout.gz)
- test_rmdout_fr3d.scores (contains all scores for the scanned alignment, extracted from 
test.rmdout.gz)
- RF00015_1222_2OZB_28_34_42_45.stats (contains for each quantile value p=0.80, p=0.85,
p=0.90, p=0.95 several details, see below)



Example module detail file ('.module.gz')
Below you see an example module detail file. It shows consecutively
PDB id
mapped Rfam family
structure class
organism of the (full-) reference sequence
the position of the putative module in the ungapped seed alignment
the position of the putative module in the gapped seed alignment
the position of the putative module in the PDB sequence
the position of the putative module in the modified, cleaned gapped seed alignment
putative module sequence ('&' is the separator of the two putative module parts)
putative module in dot-bracket notation ('&' is the separator of the two putative module parts)
base pairs and base pair type of the putative module (NOTE: base of the first base pair starts with '0', e.g. base 0 pairs with base 10 referencing to the putative module sequence)

******************************

PDB: 2OZB
RFAM: RF00015
CLASS: 1222
ORG: AABR04000514/122021-122162
Ungapped seed alignment position: 28-34/42-45
Gapped seed alignment position: 30-36/46-49
PDB position: 9-15/23-26
Gapped modified seed alignment position: 30-36/46-49
Putative Module: CAAUGAG&CGAG
Dot-Bracket notation: ((..(((&))>)
BP 1: 7 6       (cWW)
BP 2: 8 5       (tHS)
BP 3: 9 4       (tSH)
BP 4: 9 1       (tSS)
BP 5: 10        0       (cWW)

********************************


Example statistic file line: ('.stats')
Below you can see an example line of the statistics file. The explanation of each column
is given below the line.
80 RF00015_1222_2OZB_28_34_42_45 11 5 .45454545454545454545 7.611224 33 34

1. quantile value p
2. module name
3. module length
4. # base pairs
5. Complexity (#base pairs / module length)
6. delta_sco (discrimination measurement)
7. # of datapoints in 1-Q_p (right tail of distribution) for scanned alignment
8. # of datapoints in 1-Q_p (right tail of distribution) for shuffled alignment







EXAMPLES
========

You can run metaRNAmodules for demonstration on example files.
The pipeline will extract modules from FR3D files, map them on cleaned modified alignments and 
build the model. After the model has been build, the pipeline scans an alignment, shuffles this 
alignment (25x) and scans the shuffled data resulting in the discrimination measurement delta_sco.



Example files are located in the following directories:
- ./FR3D_examples_all
- ./FR3D_examples_small

./FR3D_examples_all contains 11680 FR3D database files. They are scanned by PutativeModules to extract putative modules. There are always 8 files that belong to one PDB identifier. They contain different information (see [5]).

WARNING: if you run on the big example it could take several days to produce the output.

./FR3D_examples_small contains only 24 FR3D database files for PDB identifier 1L9A, 2GIS and 2OZB used by PutativeModules to scan for putative modules. For demonstration purposes it is indicated to run the pipeline on the small example files. It will take only some minutes to get the results.


Further example files:
- ./PutativeModules_small.txt.gz (small example)
- ./PutativeModules_all.txt.gz (big example)

If you want to run the pipeline without using PutativeModules, metaRNAmodules takes either a small or a big example output file of PutativeModules as input. Both files are located in the main metaRNAmodules directory. PutativeModules_small.txt.gz contains putative modules for the same 24 FR3D files located in ./FR3D_examples_small and PutativeModules_all.txt.gz contains putative modules of 11680 FR3D files.




Other files and directories in the top-level source directory:

- lib (contains additional source packages)
- metaRNAmodules_mapping (contains the source code for the mapping part)
- metaRNAmodules_model (contains the source code for the modeling part)
- Rfam_clean (contains structure tables in txt format and the rfam_keywords.txt provided by Rfam.)
- RFAMfull10.1_STOCKHOLM (contains full Rfam alignment in Stockholm format)
- RFAMseed10.1_STOCKHOLM (contains seed Rfam alignment in Stockholm format)
- RFAMseed10.1CleanedAlignments (contains cleaned Rfam seed alignments. 'Cleaned' denotes alignments
which are 95% redundancy reduced and have a minimum of 30 sequences.)
- RFAMseed10.1CovarianceModels (contains covariance models provided by Rfam used by cmalign.)
- README file

- NewModels_representatives (see below)



NEW MODELS
==========

A directory called 'NewModels_representatives' with 28 new models is available 
in the metaRNAmodules directory. These models are located on 28 different locations on 10 Rfam families. They are
the representatives of the locations with maximal delta_sco. They can be used with RMDetect to scan RNA single 
sequences and multiple alignments for further occurences of the module.



Supplement
==========
Check 'http://rth.dk/resources/mrm/' for the supplementary material referred in the original paper. 


REFERENCES
==========


[1] Nawrocki, E.P. and Kolbe, D.L. and Eddy, S.R. (2009). Infernal 1.0: Inference of RNA alignments , Bioinformatics 25:1335-1337.

[2] Thompson, J.D. and Higgins, D.G. and Gibson, T.J. (1994). CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice, Nucleic Acids Research 22(22):4673-4680.

[3] Cruz, J.A. and Westhof, E. (2011). Sequence-based identification of 3D structural modules in RNA with RMDetect, Nature Methods 8(6):513-519.

[4] Anandam, P. and Torarinsson, E. and Ruzzo, W.L. (2009). Multiperm: shuffling multiple sequence alignments while approximately preserving dinucleotide frequencies, Bioinformatics 25(5): 668-669.

[5] Sarver, M. and Zirbel, C.L. and Stombaugh, J. and Mokdad, A. and Leontis, N.B. (2008). FR3D: Finding Local and Composite Recurrent Structural Motifs in RNA 3D Structures, Journal of Mathematical Biology 56:215-252






