RNAsnp

Efficient detection of local RNA secondary structure changes induced by SNPs

Download



Installation

After downloading the RNAsnp package, you can follow the installation steps given below,
tar -xzvf RNAsnp-1.2.tar.gz
cd RNAsnp-1.2
./configure
make
make install
Note the step make install is optional. The RNAsnp can also be executed from
Progs/RNAsnp
RNAsnp requires an environment variable named RNASNPPATH to run. The location of the RNAsnp-1.2 directory needs to be assigned to the RNASNPPATH variable. For example in the bash terminal, this can be assigned as,
export RNASNPPATH='<PATH>/RNAsnp-1.2' 
You can add this line to .bashrc file available in the home directory to avoid executing the above command every time to start with RNAsnp on new terminal.

Usage


Summary
RNAsnp requires an RNA sequence and optionally a list of SNPs to be analyzed. The effect of SNPs on local RNA secondary structure can be detected in three possible modes,

The programs RNAfold and RNAplfold are components of the Vienna RNA Package [1,2]. Strictly, the Mode 1 and 2 requires both RNA sequence and SNP as input, whereas Mode 3 requires only the RNA sequence. In Mode 3, the program starts with Mode 2 to find the effect of all possible substitutions at each nucleotide position. The SNPs with p-value less than 0.4 are further evaluated using Mode 1 and report the SNPs which have p-value less than 0.1. By default the program uses a window of 400nts, +/- 200nts around the SNP position, to compute the base pairing probability in all the three Modes. This default window length 200 can be changed between 100 and 800 in multiples of 50 for Mode 1, and between 200 and 800 in multiples of 50 for Mode 2 and 3. This restriction is necessary to keep size of parameter tables for the p-values calculations manageable. Please see below for more details about the parameters and their usage.

Syntax
RNAsnp -f <seq_file> -s <snp_file> [options]
General options:
Help:
-h, --help
	Print help and exit

--detailed-help
	Print help, including all details and hidden options, and exit

--full-help           
	Print help, including hidden options, and exit

-V, --version                 
	Print version and exit

Input Options:

-f, --seq=STRING

	File containing the input sequence

	The single input sequence can be provided either in fasta format or 
	linear sequence without any gaps

-s, --snp=STRING

	File containing the list of SNP
	
	The list of SNPs to be tested have to be provided in separate lines, see
 	README file for more description about the input format

-m, --mode=INT

	Select the mode of operation (default=`1')
                                  
        1 - perform global folding by using RNAfold and compute the difference in 
	base pair probabilities for all sequence intervals
                                    
        2 - perform local folding by using RNAplfold and compute the difference in
	base pair probabilities for all sequence intervals of fixed length
                                  
        3 - screen putative structure-disruptive SNPs in an RNA sequence

	 Mode 1 is designed to predict the effect of SNPs on short RNA sequences
	 (i.e., -w parameter is less than or equal to 500), where the base pair
	 probabilities of the wild-type and mutant RNA sequences are calculated using
	 the global folding method RNAfold. The structural difference between
	 wild-type and mutant is computed using Euclidean distance and Pearson
	 correlation measures for all sequence intervals (with minimum size of 50,
	 -l). Finally, the interval with maximum base pair distance or minimum
	 correlation coefficient and the corresponding p-value is reported.

	 Mode 2 is designed to predict the effect of SNPs on large RNA sequence.
	 Here, the base pair probabilities are calculated using the local folding
	 method RNAplfold (with -W 200 and -L 120 options). As a first step, the
	 structural difference is calculated using the Euclidean distance measure for
	 all sequence intervals of fixed window length (default: 20, -X) and allowing
	 the bases within the window can pair up to a distance of 120 (i.e. the
	 maximal span of a base pair, -Y). In the second step, the sequence interval
	 [u, v] with maximum base pair distance is selected to re-compute the
	 difference for all internal local intervals that starting at u. Finally, the
	 interval with maximum base pair distance and the corresponding p-value is
	 reported.

	 Mode 3, the combination of modes 1 and 2, is designed to screen all possible
	 structure-disruptive SNPs in an input sequence using a brute-force approach.
	 First, Mode 2 is applied to evaluate the SNP effect for all possible
	 substitutions at every nucleotide position. Second, the SNPs with p-value
	 less than 0.4 (--pvalue1) are subjected to Mode 1 to re-compute the structure
	 effect using a global folding approach. The SNPs that have significant local
	 structural effect (p-value less than 0.1, --pvalue2) are finally reported.


-w, --winsizeFold=INT
	
	length of flanking sequence on either side of SNP considered for folding (default=`200')

	By default the program uses +/- 200nts around the SNP position to compute the
  	base pair probabilities in all the three modes. This default value can be
  	changed between 100 and 800 (inclusive) in multiples of 50 for Mode 1, and
  	between 200 and 800 (inclusive) in multiples of 50 for Mode 2 and 3. In order
  	to achieve this, however, please make sure that the input sequence is at
  	least twice the size of chosen flanking. This restriction is necessary to
  	keep the size of parameter tables for the p-value calculations manageable.

   	In case the input sequence is less than twice the size of chosen flanking,
  	the RNAsnp takes the nts up to the start and end position of the given
  	sequence from the SNP position and perform the analysis. However, in this
  	case the reporting p-value is not accurate since the input sequence length
  	does not match the sequence length available in the pre-computed parameter
  	tables.


Additonal parameters:
The following optional paramaters can be provided as input together with the above general options. However, it is important to note that the precomputed background scores, which RNAsnp uses to estimate p-value, are based on the default value assigned to the following parameters. Thus, if the default value is changed for any of the following parameters (except --pvalue1 and --pvalue2), then the reporting p-value is not accurate.

-c, --cutoff=FLOAT
	cut-off for the base pair probabilities. 
	This parameter is applicable to both Mode 1 and 2 (default=`0.01')
  
	Base pair probabilities that are above this cut-off are only considered to
  	compute the Euclidean distance or correlation coefficient between wild-type
  	and mutant.

Parameters associated with mode -M 1:

-l, --minLen=INT
	minimum length of the sequence interval (default=`50')

	The structural difference between wild-type and mutant is computed for all
	sequence intervals with the selected minimum length

Parameters associated with mode -M 2:

-W, --winsize=INT
	Average the pair probabilities over windows of given size (default=`200')

-L, --span=INT
	Set the maximum allowed separation of a base pair to span. 
	i.e. no pairs (i,j) with j-i > L will be allowed. (default=`120')

-X, --regionX=INT
	Length of the local structural element that we expect to have
	an effect (default=`20')

-Y, --regionY=INT
	Length of the interval over which the local structural changes are evaluated, 
	i.e., the maximal span of a base pair (default=`120')
  
  	The functions of each of these parameters are mentioned in the description of
  	mode 2 shown above

Parameters associated with mode -M 3:

--pvalue1=FLOAT
	p-value threshold to filter SNPs that are predicted using Mode 2 (default=`0.4')

--pvalue2=FLOAT
	p-value threshold to filter SNPs that are predicted using Mode 1 (default=`0.1')

-e, --winsizeExt=INT
	size of the flanking region on either side of SNP that includes the local window
	returned by Mode 2. This subsequence is then passed to Mode 1 for re-computation
	(default=`200')

Addition option to compute edist:

-E, --edist=INT
	compute ensemble Euclidean distance between the distribution of structures between
	two sequences (default=`0')

-C, --boltzmannPreFactor=DOUBLE
	Multiply the bolztmann factor with a prefactor alpha (default=`1')

Input formats
Sequence file must contain one sequence (preferably in FASTA format). A sequence of length minimum 200 nts is required to run RNAsnp mode 1, and a minimum length of 400 nts is required to run RNAsnp mode 2 and 3.

SNP file must contain the list of SNPs that are given in separate lines. The SNPs are described as, wild-type nucletodie followed by nucleotide position followed by mutant nucleotide. In case of multiple SNPs, the SNPs are delimited by the special character "-".
Example SNP formats:
for single SNP: A201G
where, A is the wild-type nucleotide in the given sequence, 201 is the sequence position of wild-type nucleotide and G is the mutant (or SNP).
for multiple SNPs: A201G-U257A-C260G
The multiple SNPs (which occurs together) are defined next to each other with the delimiter "-" between them.

Examples

The sequence and SNP files used for the demonstration here are present in the directory 'examples/'
RNAsnp mode 1
1) Test for the effect of single SNP with RNAsnp default mode -m 1
$ RNAsnp -f examples/seq1.txt -s examples/snp1.txt
SNP     W   Slen   GC   interval d_max  p-value interval r_min  p-value
U1013C  200 3344 0.5411 975-1025 0.2432 0.0724  998-1052 0.0615 0.0932
2) Test for the effect of mutiple SNPs with RNAsnp default mode -m 1
$ RNAsnp -f examples/seq2.txt -s examples/snp2.txt
SNP             W   Slen   GC   interval  d_max  p-value interval  r_min  p-value
C9294A-U9296G   200 9605 0.4814 9261-9310 0.1951 0.0749  9268-9317 0.2345 0.1213

RNAsnp mode 2
1) Test for the effect of single SNP with RNAsnp mode 2
$ RNAsnp -f examples/seq1.txt -s examples/snp1.txt -m 2
SNP     w   Slen   GC   max_k d_max  p-value interval   d       p-value
U1013C  200 3344 0.5411 994   4.3961 0.2176  994-1019   0.1265  0.1232
2) Test for the effect of single SNP with RNAsnp mode 2
$ RNAsnp -f examples/seq2.txt -s examples/snp2.txt -m 2
SNP             w   Slen   GC   max_k d_max  p-value interval   d       p-value
C9294A-U9296G   200 9605 0.4814 9270  7.0487 0.0624  9270-9298  0.2463  0.0099

RNAsnp mode 3
1) Screen possible structure-disruptive SNPs in a sequence with default p-value thresholds (pvalue1<0.4 and pvalue2<0.1)
$ RNAsnp -f examples/seq1.txt -m 3
SNP  w  Slen   GC   interval d_max  pvalue1 ewin interval d_max  pvalue2
G1A 200 3344 0.5522  1-39    0.0185 0.2024  200   1-50    0.0961 0.0467
G1C 200 3344 0.5522  1-46    0.0421 0.0755  200   1-50    0.1581 0.0183
....
....
2) Screen putative structure-disruptive SNPs in a sequence with different p-value thresholds (pvalue1<0.1 and pvalue2<0.1)
$ RNAsnp -f examples/seq1.txt -m 3 --pvalue1 0.1 --pvalue2 0.1
SNP  w  Slen   GC  interval d_max  pvalue1 ewin interval d_max  pvalue2
G1C 200 3344 0.5522 1-46    0.0421 0.0755  200   1-50   0.1581  0.0183
G7A 200 3344 0.5556 1-43    0.2236 0.0207  200   1-50   0.1570  0.0996
....
....
Please refer to the REAMDE file from the RNAsnp package to get more details about the ouput.

Datasets


The three different SNP datasets used for the RNAsnp analysis can be downloaded from here. It contains the details of the SNPs, mapped sequences and the RNAsnp output for each dataset.

Changelog

RNAsnp software release and changes,
RNAsnp-1.2 RNAsnp-1.1

References

  1. Hofacker IL, Fontana W, Stadler PF, Bonhoeffer LS, Tacker M, Schuster P. (1994) Fast Folding and Comparison of RNA Secondary Structures. Monatshefte f. Chemie 125: 167-188
  2. Lorenz R, Bernhart SH, Honer zu Siederdissen C, Tafer H, Flamm C, Stadler PF, Hofacker IL (2011) ViennaRNA Package 2.0. Alg. Mol. Biol. 6:26.
If you find this software useful for your research, please cite the following work:

Contact

For any comments or bug reports please contact the authors. Email: sabari@rth.dk, htafer@bioinf.uni-leipzig.de