RPASuite: a computational pipeline to analyze the processing pattern of RNA
RPASuite (RNA Processing Analysis Suite) is a computational pipeline to identify differentially and coherently processed transcripts
using RNA-seq data obtained from multiple tissue or cell lines.
- Differentially Processed Loci (DPL): genomic loci which encode for transcripts that undergoes different post-transcriptional processing for a subset of tissues or cell lines in comparison to the rest, example miRNA arm switching.
- Coherently Processed Loci (CPL): genomic loci which encode for transcripts that undergoes similar post-transcriptonal processing for all tissues or cell lines under study, example miRNA processed consistently from the same pre-miRNA strand.
Download
- RPASuite v0.02 release (Jul 10, 2015)
- analysis can be run in parallel, meaning simulataneous run for each chromosome (-x parameter)
- now uses featureCount in place of htseq-count leading to reduced computational time
- bug fix that was sometimes leading to inconsistent bourndaries of blocks within block group
- RPASuite v0.01 release (Jan 29, 2015)
Programs and datasets
The directory 'RPASuite_v0.02' contains following programs
1. rnaProcessingAna: main program to run the pipeline (requires bash environment) 2. getBG: program to define block groups or read profiles using blockbuster 3. bam2bed.pl: program to convert bam into bed format 4. blockbuster.x: program to group closely spaced reads into read profiles 5. flagCluster.pl: program to flag read profiles with their genomic annotation 6. validateBlockbuster.pl: program to reformat the unique id of read profiles 7. compBlockStat.pl: program to compute variuos properties of read profiles such as entropy 8. indexBed.sh: program to index the mapped reads in bed format for fast access 9. rnaExpAna.pl: program to analyze the reads profiles for differential and coherent processing 10. estimateSizeFactor.pl: program to perform RLE normalization of read profile expression 11. organize2bed.pl: program to organize results from multiple files into a single file 12. plotDiffProcessing.pl: program for graphical visualization of results 13. deepBlockAlign.x: program to align two read profiles
Installation
To install RPASuite, download RPASuite.tar.gz and unpack it. A directory, RPASuite will be created
Now compile and create executables of deepBlockAlign and blockbuster1. tar -zxvf RPASuite.tar.gz
Export environment variable 'RPAPATH' containing path to RPASuite installation directory2. make or make all
Add 'RPAPATH' to your 'PATH' environment variable3. export RPAPATH=<path to RPAsuite installation directory>
Add 'RPAPATH' to your 'PERL5LIB' environment variable4. export PATH=$PATH:$RPAPATH/bin
Setup a local python path (obsolete, only required for v0.01)5. export PERL5LIB=$PERL5LIB:$RPAPATH/share/perl/
To permanently add or update the environment variable(s), add above four export commands in your ~/.bashrc file6. export PYTHONPATH=$HOME/lib/python2.7/site-packages
Dependency
We assume that the following softwares are intalled and working: perl, python, R, latex, and gcc. As a minimum you would need to do the following on a ubuntu system
Install the needed perl modulessudo apt-get install build-essential python2.7-dev python-numpy\ python-matplotlib libxml2-dev libgd-perl texlive
Install the needed R modules by entering R (type R on the command line) and then enter the following three commands (follow the instructions on the screen):sudo cpan List::Util Tie::IxHash Statistics::Descriptive\ Statistics::RankCorrelation List::Compare GD::Simple
Download samtools, go to the download location and doinstall.packages(c("clValid", "session", "amap", "pvclust", "snow", "optparse", "XML")) source("http://bioconductor.org/biocLite.R") biocLite(c("ctc", "DESeq"))
Download bedtools, go to the download location and dotar xjf samtools-1.1.tar.bz2 cd samtools-1.1 make -j10 prefix=$HOME install
Download featureCounts (subread), go to the download location and dotar xzf BEDTools.v2.17.0.tar.gz cd bedtools-2.17.0/ make -j 10 cp bin/* $HOME/bin
Download bedGraphToBigWig for your operating system, go to the download location and dotar xzf subread-1.4.6-p3-Linux-x86_64.tar.gz cd subread-1.4.6-p3-Linux-x86_64 cp bin/featureCounts $HOME/bin
Install pysam the pip python installer: (obsolete, only required for v0.01)cp bedGraphToBigWig $HOME/bin chmod 755 $HOME/bin/bedGraphToBigWig
Download htseq-count, go to the download location and do (obsolete, only required for v0.01)pip install pysam
tar zxf HTSeq-0.6.1.tar.gz cd HTSeq-0.6.1/ python setup.py install --prefix=$HOME
Genome annotations
The pipeline also requires the genomic coordinates of RNA and loci annotation in the reference genome. This is organized in two separate files containing
1. annotations/rna.<assembly>.bed.format.gz genomic coordinates of non-coding RNA in the genome assembly. 2. annotations/loci.<assembly>.bed.format.gz genomic coordinates of loci (exon, intron, UTRs) in the genome assembly.Currently, the annotation for four assemblies: hg19, hg38, mm9 and mm10 is available with the download. The genomic coordinates of the respective annotations (RNA or loci) are based on the Ensembl annotations downloaded from ftp://ftp.ensembl.org/pub/
1. hg19: Ensembl release 75 2. hg38: Ensembl release 78 3. mm9: Ensembl release 67 4. mm10: Ensembl release 78To build the two annotation files for a new genome assembly, we provide a script ensembl2annotation that takes annotation files from Ensembl as input. Example usage to build RNA and loci files:
1. create a directory within annotations with a unique assembly identifier like mm10 2. Download Ensembl ncRNA annotation like ftp://ftp.ensembl.org/pub//release-78/fasta/homo_sapiens/ncrna/Homo_sapiens.GRCh38.ncrna.fa.gz 3. ensembl2annotation -i Mus_musculus.GRCm38.ncrna.fa.gz -r 78 | gzip > rna.mm10.bed.format.gz 4. Download Ensembl gene annotation like ftp://ftp.ensembl.org/pub//release-78//gtf/mus_musculus/Mus_musculus.GRCm38.78.gtf.gz 5. ensembl2annotation -i Mus_musculus.GRCm38.78.gtf.gz -r 78 -l | gzip > loci.mm10.bed.format.gz 6. Add new file information in rnaProcessingAna script ("initialize genome annotation files" section)
Usage
RPASuite is called with the following parameters
rnaProcessingAna -i <input bam files separated by a comma> [OPTIONS]
Available options are:
Program: rnaProcessingAna (analyze multiple short RNA-seq samples to detect differential and coherent RNA processing) Author: RTH, University of Copenhagen, Denmark Version: 0.02 Contact: sachin@rth.dk Usage: rnaProcessingAna -i <files> [OPTIONS] -i <file> [read alignment files in BAM format separated by a comma (minimum six)] [OPTIONS] -o <dir> [output directory (default: rpa)] -g <string> [genome assembly (default: hg19)] [currently supported: mm9, mm10, hg19 and hg38] -r [biological replicates are present] -x [run in parallel, meaning simultaneous analysis for each chromosome using one processor] -h [help] [OPTIONS (block group generation)] -d <int> [minimum distance between two block groups (default: 50)] -u <int> [minimum reads in the block group (default: 10)] -l <int> [minimum reads in the block (default: 10)] -e <float> [stddev scale for mapped reads (default: 0.5)] -t <string> [minimum block reads threshold (abs or rel, default: rel)] -q <int> [maximum length of the block groups (default: 500)] [OPTIONS (rna processing analysis)] -p <int> [fraction of samples in which block group should be observed (default: 1)] [1: all samples] -c <float> [cluster score threshold (default: 0.15)] -a <float> [p-value threshold for fisher's exact test (default: 0.05)] -n [create image file for differentially and coherently processed loci in pdf format] [this is time consuming and takes considerable hard disk space]
Example
An usage example of RPASuite is shown below. As input, the pipeline requires mapped reads in BAM format. Example dataset files are provided with the download
rnaProcessingAna -i\ data/test_data/BloodGm12878Chr6.bam,data/test_data/BrainSknshraChr6.bam,\ data/test_data/BreastMcf7Chr6.bam,data/test_data/CervixHelas3Chr6.bam,\ data/test_data/EpitheliumA549Chr6.bam,data/test_data/EscH1hescChr6.bam\ -o data/test_run &>data/run.log
Input
As input, the pipeline requires mapped reads in BAM format. The name of the input files should be formatted as
Note: <unique id> should contain only alphanumeric characters.
Input file name: <unique id>.bam (example: BloodGm12878.bam)If biological or technical replicates (atmost two) of a sample are present, the replicate information should be provided in the file name as
Input file name (replicate 1): <unique id><Rep1>.bam (example: BloodGm12878Rep1.bam) Input file name (replicate 2): <unique id><Rep2>.bam (example: BloodGm12878Rep2.bam)Note: The chromosome identifier in the input BAM files should start with chr, for example as chrY and not like Y.
Note: <unique id> should contain only alphanumeric characters.
Output
The results from the RPASuite are compiled in two text files:
a) RESULTS_DPL.TXT: It contains a list of differentially processed loci (DPL) in tab delimited format. Each line contains information for a differentially processed locus, organized into following columns
b) RESULTS_CPL.TXT: It contains a list of coherently processed loci (CPL) in tab delimited format. Each line contains information for a coherently processed locus, organized in the same 16 columns as for DPL listed above. A locus is defined as coherently processed, if its clusterScore < 0.15 and dbaScore > 0.8 (default)
For easy access, the html version of the two result files (RESULTS_DPL.HTML and RESULTS_CPL.HTML) are also provided within the output directory
a) RESULTS_DPL.TXT: It contains a list of differentially processed loci (DPL) in tab delimited format. Each line contains information for a differentially processed locus, organized into following columns
1. chr: chromosome 2. start: start coordinate 3. end: end coordinate 4. id: unique id 5. clusterScore: measures how well separated read profiles within a cluster are to rest of the profiles 6. strand: genomic strand 7. fScore: number of read profile clusters that showed significant p-value computed using Fisher's exact test 8. dbaScore: mean all vs all alignment score of read profiles 9. annotation: genomic annotation 10. loci: genomic loci 11. blocks: number of read blocks per read profile 12. meanEntropy: mean randomness in the arrangement of reads within the read profiles 13. meanExpr: mean normalized expression of reads profiles 14. perObs: fraction of samples in which a read profile is oberved (1 means all) 15. clusters: number of read profile clusters computed using pvclust at p-value < 0.05 16. clusterInfo: comma separated list of clusters to which each read profile belongsA locus is defined as differentially processed, if its clusterScore ≥ 0.15 and fScore ≥ 1 (default)
b) RESULTS_CPL.TXT: It contains a list of coherently processed loci (CPL) in tab delimited format. Each line contains information for a coherently processed locus, organized in the same 16 columns as for DPL listed above. A locus is defined as coherently processed, if its clusterScore < 0.15 and dbaScore > 0.8 (default)
For easy access, the html version of the two result files (RESULTS_DPL.HTML and RESULTS_CPL.HTML) are also provided within the output directory
Contact
For queries, please contact sachin@rth.dk or gorodkin@rth.dk
Citation
Pundhir S and Gorodkin J. (2015) Differential and coherent processing patterns from small RNAs. Sci Rep. 5:12062. [PMID 26166713]
License
RPASuite: a computational pipeline to analyze the processing pattern of RNA Copyright (C) 2014 Sachin Pundhir (sachin@rth.dk) This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program. If not, see <http://www.gnu.org/licenses/>.