RPASuite: a computational pipeline to analyze the processing pattern of RNA
RPASuite (RNA Processing Analysis Suite) is a computational pipeline to identify differentially and coherently processed transcripts
using RNA-seq data obtained from multiple tissue or cell lines.
- Differentially Processed Loci (DPL): genomic loci which encode for transcripts that undergoes different post-transcriptional processing for a subset of tissues or cell lines in comparison to the rest, example miRNA arm switching.
- Coherently Processed Loci (CPL): genomic loci which encode for transcripts that undergoes similar post-transcriptonal processing for all tissues or cell lines under study, example miRNA processed consistently from the same pre-miRNA strand.
Download
- RPASuite v0.02 release (Jul 10, 2015)
- analysis can be run in parallel, meaning simulataneous run for each chromosome (-x parameter)
- now uses featureCount in place of htseq-count leading to reduced computational time
- bug fix that was sometimes leading to inconsistent bourndaries of blocks within block group
- RPASuite v0.01 release (Jan 29, 2015)
Programs and datasets
The directory 'RPASuite_v0.02' contains following programs
1. rnaProcessingAna: main program to run the pipeline (requires bash environment) 2. getBG: program to define block groups or read profiles using blockbuster 3. bam2bed.pl: program to convert bam into bed format 4. blockbuster.x: program to group closely spaced reads into read profiles 5. flagCluster.pl: program to flag read profiles with their genomic annotation 6. validateBlockbuster.pl: program to reformat the unique id of read profiles 7. compBlockStat.pl: program to compute variuos properties of read profiles such as entropy 8. indexBed.sh: program to index the mapped reads in bed format for fast access 9. rnaExpAna.pl: program to analyze the reads profiles for differential and coherent processing 10. estimateSizeFactor.pl: program to perform RLE normalization of read profile expression 11. organize2bed.pl: program to organize results from multiple files into a single file 12. plotDiffProcessing.pl: program for graphical visualization of results 13. deepBlockAlign.x: program to align two read profiles
Installation
To install RPASuite, download RPASuite.tar.gz and unpack it. A directory, RPASuite will be created
Now compile and create executables of deepBlockAlign and blockbuster1. tar -zxvf RPASuite.tar.gz
Export environment variable 'RPAPATH' containing path to RPASuite installation directory2. make or make all
Add 'RPAPATH' to your 'PATH' environment variable3. export RPAPATH=<path to RPAsuite installation directory>
Add 'RPAPATH' to your 'PERL5LIB' environment variable4. export PATH=$PATH:$RPAPATH/bin
Setup a local python path (obsolete, only required for v0.01)5. export PERL5LIB=$PERL5LIB:$RPAPATH/share/perl/
To permanently add or update the environment variable(s), add above four export commands in your ~/.bashrc file6. export PYTHONPATH=$HOME/lib/python2.7/site-packages
Dependency
We assume that the following softwares are intalled and working: perl, python, R, latex, and gcc. As a minimum you would need to do the following on a ubuntu system
Install the needed perl modulessudo apt-get install build-essential python2.7-dev python-numpy\ python-matplotlib libxml2-dev libgd-perl texlive
Install the needed R modules by entering R (type R on the command line) and then enter the following three commands (follow the instructions on the screen):sudo cpan List::Util Tie::IxHash Statistics::Descriptive\ Statistics::RankCorrelation List::Compare GD::Simple
Download samtools, go to the download location and doinstall.packages(c("clValid", "session", "amap", "pvclust", "snow", "optparse", "XML")) source("http://bioconductor.org/biocLite.R") biocLite(c("ctc", "DESeq"))
Download bedtools, go to the download location and dotar xjf samtools-1.1.tar.bz2 cd samtools-1.1 make -j10 prefix=$HOME install
Download featureCounts (subread), go to the download location and dotar xzf BEDTools.v2.17.0.tar.gz cd bedtools-2.17.0/ make -j 10 cp bin/* $HOME/bin
Download bedGraphToBigWig for your operating system, go to the download location and dotar xzf subread-1.4.6-p3-Linux-x86_64.tar.gz cd subread-1.4.6-p3-Linux-x86_64 cp bin/featureCounts $HOME/bin
Install pysam the pip python installer: (obsolete, only required for v0.01)cp bedGraphToBigWig $HOME/bin chmod 755 $HOME/bin/bedGraphToBigWig
Download htseq-count, go to the download location and do (obsolete, only required for v0.01)pip install pysam
tar zxf HTSeq-0.6.1.tar.gz cd HTSeq-0.6.1/ python setup.py install --prefix=$HOME
Genome annotations
The pipeline also requires the genomic coordinates of RNA and loci annotation in the reference genome. This is organized in two separate files containing
1. annotations/rna.<assembly>.bed.format.gz
genomic coordinates of non-coding RNA in the genome assembly.
2. annotations/loci.<assembly>.bed.format.gz
genomic coordinates of loci (exon, intron, UTRs) in the genome assembly.
Currently, the annotation for four assemblies: hg19, hg38, mm9 and mm10 is available with the download. The genomic coordinates of the respective annotations (RNA or loci) are based on the Ensembl annotations downloaded from ftp://ftp.ensembl.org/pub/
1. hg19: Ensembl release 75 2. hg38: Ensembl release 78 3. mm9: Ensembl release 67 4. mm10: Ensembl release 78To build the two annotation files for a new genome assembly, we provide a script ensembl2annotation that takes annotation files from Ensembl as input. Example usage to build RNA and loci files:
1. create a directory within annotations with a unique assembly identifier like mm10
2. Download Ensembl ncRNA annotation like
ftp://ftp.ensembl.org/pub//release-78/fasta/homo_sapiens/ncrna/Homo_sapiens.GRCh38.ncrna.fa.gz
3. ensembl2annotation -i Mus_musculus.GRCm38.ncrna.fa.gz -r 78 | gzip > rna.mm10.bed.format.gz
4. Download Ensembl gene annotation like
ftp://ftp.ensembl.org/pub//release-78//gtf/mus_musculus/Mus_musculus.GRCm38.78.gtf.gz
5. ensembl2annotation -i Mus_musculus.GRCm38.78.gtf.gz -r 78 -l | gzip > loci.mm10.bed.format.gz
6. Add new file information in rnaProcessingAna script ("initialize genome annotation files" section)
Usage
RPASuite is called with the following parameters
rnaProcessingAna -i <input bam files separated by a comma> [OPTIONS]
Available options are:
Program: rnaProcessingAna (analyze multiple short RNA-seq samples to detect
differential and coherent RNA processing)
Author: RTH, University of Copenhagen, Denmark
Version: 0.02
Contact: sachin@rth.dk
Usage: rnaProcessingAna -i <files> [OPTIONS]
-i <file> [read alignment files in BAM format separated by a comma (minimum six)]
[OPTIONS]
-o <dir> [output directory (default: rpa)]
-g <string> [genome assembly (default: hg19)]
[currently supported: mm9, mm10, hg19 and hg38]
-r [biological replicates are present]
-x [run in parallel, meaning simultaneous analysis for each chromosome using one processor]
-h [help]
[OPTIONS (block group generation)]
-d <int> [minimum distance between two block groups (default: 50)]
-u <int> [minimum reads in the block group (default: 10)]
-l <int> [minimum reads in the block (default: 10)]
-e <float> [stddev scale for mapped reads (default: 0.5)]
-t <string> [minimum block reads threshold (abs or rel, default: rel)]
-q <int> [maximum length of the block groups (default: 500)]
[OPTIONS (rna processing analysis)]
-p <int> [fraction of samples in which block group should be observed (default: 1)]
[1: all samples]
-c <float> [cluster score threshold (default: 0.15)]
-a <float> [p-value threshold for fisher's exact test (default: 0.05)]
-n [create image file for differentially and coherently processed loci in pdf format]
[this is time consuming and takes considerable hard disk space]
Example
An usage example of RPASuite is shown below. As input, the pipeline requires mapped reads in BAM format. Example dataset files are provided with the download
rnaProcessingAna -i\ data/test_data/BloodGm12878Chr6.bam,data/test_data/BrainSknshraChr6.bam,\ data/test_data/BreastMcf7Chr6.bam,data/test_data/CervixHelas3Chr6.bam,\ data/test_data/EpitheliumA549Chr6.bam,data/test_data/EscH1hescChr6.bam\ -o data/test_run &>data/run.log
Input
As input, the pipeline requires mapped reads in BAM format. The name of the input files should be formatted as
Note: <unique id> should contain only alphanumeric characters.
Input file name: <unique id>.bam (example: BloodGm12878.bam)If biological or technical replicates (atmost two) of a sample are present, the replicate information should be provided in the file name as
Input file name (replicate 1): <unique id><Rep1>.bam (example: BloodGm12878Rep1.bam) Input file name (replicate 2): <unique id><Rep2>.bam (example: BloodGm12878Rep2.bam)Note: The chromosome identifier in the input BAM files should start with chr, for example as chrY and not like Y.
Note: <unique id> should contain only alphanumeric characters.
Output
The results from the RPASuite are compiled in two text files:
a) RESULTS_DPL.TXT: It contains a list of differentially processed loci (DPL) in tab delimited format. Each line contains information for a differentially processed locus, organized into following columns
b) RESULTS_CPL.TXT: It contains a list of coherently processed loci (CPL) in tab delimited format. Each line contains information for a coherently processed locus, organized in the same 16 columns as for DPL listed above. A locus is defined as coherently processed, if its clusterScore < 0.15 and dbaScore > 0.8 (default)
For easy access, the html version of the two result files (RESULTS_DPL.HTML and RESULTS_CPL.HTML) are also provided within the output directory
a) RESULTS_DPL.TXT: It contains a list of differentially processed loci (DPL) in tab delimited format. Each line contains information for a differentially processed locus, organized into following columns
1. chr: chromosome
2. start: start coordinate
3. end: end coordinate
4. id: unique id
5. clusterScore: measures how well separated read profiles within a cluster
are to rest of the profiles
6. strand: genomic strand
7. fScore: number of read profile clusters that showed significant p-value
computed using Fisher's exact test
8. dbaScore: mean all vs all alignment score of read profiles
9. annotation: genomic annotation
10. loci: genomic loci
11. blocks: number of read blocks per read profile
12. meanEntropy: mean randomness in the arrangement of reads within the read profiles
13. meanExpr: mean normalized expression of reads profiles
14. perObs: fraction of samples in which a read profile is oberved (1 means all)
15. clusters: number of read profile clusters computed using pvclust at p-value < 0.05
16. clusterInfo: comma separated list of clusters to which each read profile belongs
A locus is defined as differentially processed, if its clusterScore ≥ 0.15 and fScore ≥ 1 (default)b) RESULTS_CPL.TXT: It contains a list of coherently processed loci (CPL) in tab delimited format. Each line contains information for a coherently processed locus, organized in the same 16 columns as for DPL listed above. A locus is defined as coherently processed, if its clusterScore < 0.15 and dbaScore > 0.8 (default)
For easy access, the html version of the two result files (RESULTS_DPL.HTML and RESULTS_CPL.HTML) are also provided within the output directory
Contact
For queries, please contact sachin@rth.dk or gorodkin@rth.dk
Citation
Pundhir S and Gorodkin J. (2015) Differential and coherent processing patterns from small RNAs. Sci Rep. 5:12062. [PMID 26166713]
License
RPASuite: a computational pipeline to analyze the processing pattern of RNA Copyright (C) 2014 Sachin Pundhir (sachin@rth.dk) This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program. If not, see <http://www.gnu.org/licenses/>.