RNAsnp
Efficient detection of local RNA secondary structure changes induced by SNPsDownload
- RNAsnp-1.2.tar.gz, last modified on Feb 24, 2016. See changelog for details
Installation
After downloading the RNAsnp package, you can follow the installation steps given below,tar -xzvf RNAsnp-1.2.tar.gz cd RNAsnp-1.2 ./configure make make installNote the step make install is optional. The RNAsnp can also be executed from
Progs/RNAsnpRNAsnp requires an environment variable named RNASNPPATH to run. The location of the RNAsnp-1.2 directory needs to be assigned to the RNASNPPATH variable. For example in the bash terminal, this can be assigned as,
export RNASNPPATH='<PATH>/RNAsnp-1.2'You can add this line to .bashrc file available in the home directory to avoid executing the above command every time to start with RNAsnp on new terminal.
Usage
Summary
RNAsnp requires an RNA sequence and optionally a list of SNPs to be analyzed. The effect of SNPs on local RNA secondary structure can be detected in three possible modes,
- Mode 1: The first mode is designed to compute the effect of SNPs by using global folding. This option should be used only for short input sequences, since the base pair probabilities are calculated using RNAfold.
- Mode 2: The second mode is designed to compute the effect of SNPs on large sequences. Here the local base pair probabilities are calculated using RNAplfold (with the parameters -W 200 and -L 120).
- Mode 3: The third mode is the combination of the above two. It is intended to determine the positions of putative structure-disruptive SNPs using either transcript or genome sequence.
Syntax
RNAsnp -f <seq_file> -s <snp_file> [options]General options:
Help: -h, --help Print help and exit --detailed-help Print help, including all details and hidden options, and exit --full-help Print help, including hidden options, and exit -V, --version Print version and exit Input Options: -f, --seq=STRING File containing the input sequence The single input sequence can be provided either in fasta format or linear sequence without any gaps -s, --snp=STRING File containing the list of SNP The list of SNPs to be tested have to be provided in separate lines, see README file for more description about the input format -m, --mode=INT Select the mode of operation (default=`1') 1 - perform global folding by using RNAfold and compute the difference in base pair probabilities for all sequence intervals 2 - perform local folding by using RNAplfold and compute the difference in base pair probabilities for all sequence intervals of fixed length 3 - screen putative structure-disruptive SNPs in an RNA sequence Mode 1 is designed to predict the effect of SNPs on short RNA sequences (i.e., -w parameter is less than or equal to 500), where the base pair probabilities of the wild-type and mutant RNA sequences are calculated using the global folding method RNAfold. The structural difference between wild-type and mutant is computed using Euclidean distance and Pearson correlation measures for all sequence intervals (with minimum size of 50, -l). Finally, the interval with maximum base pair distance or minimum correlation coefficient and the corresponding p-value is reported. Mode 2 is designed to predict the effect of SNPs on large RNA sequence. Here, the base pair probabilities are calculated using the local folding method RNAplfold (with -W 200 and -L 120 options). As a first step, the structural difference is calculated using the Euclidean distance measure for all sequence intervals of fixed window length (default: 20, -X) and allowing the bases within the window can pair up to a distance of 120 (i.e. the maximal span of a base pair, -Y). In the second step, the sequence interval [u, v] with maximum base pair distance is selected to re-compute the difference for all internal local intervals that starting at u. Finally, the interval with maximum base pair distance and the corresponding p-value is reported. Mode 3, the combination of modes 1 and 2, is designed to screen all possible structure-disruptive SNPs in an input sequence using a brute-force approach. First, Mode 2 is applied to evaluate the SNP effect for all possible substitutions at every nucleotide position. Second, the SNPs with p-value less than 0.4 (--pvalue1) are subjected to Mode 1 to re-compute the structure effect using a global folding approach. The SNPs that have significant local structural effect (p-value less than 0.1, --pvalue2) are finally reported. -w, --winsizeFold=INT length of flanking sequence on either side of SNP considered for folding (default=`200') By default the program uses +/- 200nts around the SNP position to compute the base pair probabilities in all the three modes. This default value can be changed between 100 and 800 (inclusive) in multiples of 50 for Mode 1, and between 200 and 800 (inclusive) in multiples of 50 for Mode 2 and 3. In order to achieve this, however, please make sure that the input sequence is at least twice the size of chosen flanking. This restriction is necessary to keep the size of parameter tables for the p-value calculations manageable. In case the input sequence is less than twice the size of chosen flanking, the RNAsnp takes the nts up to the start and end position of the given sequence from the SNP position and perform the analysis. However, in this case the reporting p-value is not accurate since the input sequence length does not match the sequence length available in the pre-computed parameter tables.
Additonal parameters:
The following optional paramaters can be provided as input together with the above general options. However, it is important to note that the precomputed background scores, which RNAsnp uses to estimate p-value, are based on the default value assigned to the following parameters. Thus, if the default value is changed for any of the following parameters (except --pvalue1 and --pvalue2), then the reporting p-value is not accurate.
-c, --cutoff=FLOAT cut-off for the base pair probabilities. This parameter is applicable to both Mode 1 and 2 (default=`0.01') Base pair probabilities that are above this cut-off are only considered to compute the Euclidean distance or correlation coefficient between wild-type and mutant. Parameters associated with mode -M 1: -l, --minLen=INT minimum length of the sequence interval (default=`50') The structural difference between wild-type and mutant is computed for all sequence intervals with the selected minimum length Parameters associated with mode -M 2: -W, --winsize=INT Average the pair probabilities over windows of given size (default=`200') -L, --span=INT Set the maximum allowed separation of a base pair to span. i.e. no pairs (i,j) with j-i > L will be allowed. (default=`120') -X, --regionX=INT Length of the local structural element that we expect to have an effect (default=`20') -Y, --regionY=INT Length of the interval over which the local structural changes are evaluated, i.e., the maximal span of a base pair (default=`120') The functions of each of these parameters are mentioned in the description of mode 2 shown above Parameters associated with mode -M 3: --pvalue1=FLOAT p-value threshold to filter SNPs that are predicted using Mode 2 (default=`0.4') --pvalue2=FLOAT p-value threshold to filter SNPs that are predicted using Mode 1 (default=`0.1') -e, --winsizeExt=INT size of the flanking region on either side of SNP that includes the local window returned by Mode 2. This subsequence is then passed to Mode 1 for re-computation (default=`200') Addition option to compute edist: -E, --edist=INT compute ensemble Euclidean distance between the distribution of structures between two sequences (default=`0') -C, --boltzmannPreFactor=DOUBLE Multiply the bolztmann factor with a prefactor alpha (default=`1')
Input formats
Sequence file must contain one sequence (preferably in FASTA format). A sequence of length minimum 200 nts is required to run RNAsnp mode 1, and a minimum length of 400 nts is required to run RNAsnp mode 2 and 3.
SNP file must contain the list of SNPs that are given in separate lines. The SNPs are described as, wild-type nucletodie followed by nucleotide position followed by mutant nucleotide. In case of multiple SNPs, the SNPs are delimited by the special character "-".
Example SNP formats:
for single SNP: A201G
where, A is the wild-type nucleotide in the given sequence, 201 is the sequence position of wild-type nucleotide and G is the mutant (or SNP).
for multiple SNPs: A201G-U257A-C260G
The multiple SNPs (which occurs together) are defined next to each other with the delimiter "-" between them.
Examples
The sequence and SNP files used for the demonstration here are present in the directory 'examples/'RNAsnp mode 1
1) Test for the effect of single SNP with RNAsnp default mode -m 1
$ RNAsnp -f examples/seq1.txt -s examples/snp1.txt SNP W Slen GC interval d_max p-value interval r_min p-value U1013C 200 3344 0.5411 975-1025 0.2432 0.0724 998-1052 0.0615 0.09322) Test for the effect of mutiple SNPs with RNAsnp default mode -m 1
$ RNAsnp -f examples/seq2.txt -s examples/snp2.txt SNP W Slen GC interval d_max p-value interval r_min p-value C9294A-U9296G 200 9605 0.4814 9261-9310 0.1951 0.0749 9268-9317 0.2345 0.1213
RNAsnp mode 2
1) Test for the effect of single SNP with RNAsnp mode 2
$ RNAsnp -f examples/seq1.txt -s examples/snp1.txt -m 2 SNP w Slen GC max_k d_max p-value interval d p-value U1013C 200 3344 0.5411 994 4.3961 0.2176 994-1019 0.1265 0.12322) Test for the effect of single SNP with RNAsnp mode 2
$ RNAsnp -f examples/seq2.txt -s examples/snp2.txt -m 2 SNP w Slen GC max_k d_max p-value interval d p-value C9294A-U9296G 200 9605 0.4814 9270 7.0487 0.0624 9270-9298 0.2463 0.0099
RNAsnp mode 3
1) Screen possible structure-disruptive SNPs in a sequence with default p-value thresholds (pvalue1<0.4 and pvalue2<0.1)
$ RNAsnp -f examples/seq1.txt -m 3 SNP w Slen GC interval d_max pvalue1 ewin interval d_max pvalue2 G1A 200 3344 0.5522 1-39 0.0185 0.2024 200 1-50 0.0961 0.0467 G1C 200 3344 0.5522 1-46 0.0421 0.0755 200 1-50 0.1581 0.0183 .... ....2) Screen putative structure-disruptive SNPs in a sequence with different p-value thresholds (pvalue1<0.1 and pvalue2<0.1)
$ RNAsnp -f examples/seq1.txt -m 3 --pvalue1 0.1 --pvalue2 0.1 SNP w Slen GC interval d_max pvalue1 ewin interval d_max pvalue2 G1C 200 3344 0.5522 1-46 0.0421 0.0755 200 1-50 0.1581 0.0183 G7A 200 3344 0.5556 1-43 0.2236 0.0207 200 1-50 0.1570 0.0996 .... ....Please refer to the REAMDE file from the RNAsnp package to get more details about the ouput.
Datasets
The three different SNP datasets used for the RNAsnp analysis can be downloaded from here. It contains the details of the SNPs, mapped sequences and the RNAsnp output for each dataset.
Changelog
RNAsnp software release and changes,RNAsnp-1.2
- Feb 24, 2016 - Updated the README file and help page with detailed description of each parameters and its usage. Also, included warnings if the default values for additional parameters are changed, because this could affect the accuracy of reporting p-value.
- Dec 03, 2012 - first release
- Apr 30, 2013 - fixed installation issue that occurred 'make install' option overwritten the library file "libRNA.a" if a Vienna RNA package is previously installed.
- Jul 16, 2014 - Updated copyright details
References
- Hofacker IL, Fontana W, Stadler PF, Bonhoeffer LS, Tacker M, Schuster P. (1994) Fast Folding and Comparison of RNA Secondary Structures. Monatshefte f. Chemie 125: 167-188
- Lorenz R, Bernhart SH, Honer zu Siederdissen C, Tafer H, Flamm C, Stadler PF, Hofacker IL (2011) ViennaRNA Package 2.0. Alg. Mol. Biol. 6:26.
- Sabarinathan R, Tafer H, Seemann SE, Hofacker IL, Stadler PF, Gorodkin J. RNAsnp: Efficient detection of local RNA secondary structure changes induced by SNPs. Human Mutation 34:546-556, 2013 [ PubMed]