RPASuite: a computational pipeline to analyze the processing pattern of RNA

RPASuite (RNA Processing Analysis Suite) is a computational pipeline to identify differentially and coherently processed transcripts using RNA-seq data obtained from multiple tissue or cell lines.

Download


Programs and datasets


The directory 'RPASuite_v0.02' contains following programs
1.  rnaProcessingAna: main program to run the pipeline (requires bash environment)
2.  getBG:  program to define block groups or read profiles using blockbuster
3.  bam2bed.pl:  program to convert bam into bed format
4.  blockbuster.x:  program to group closely spaced reads into read profiles
5.  flagCluster.pl:  program to flag read profiles with their genomic annotation
6.  validateBlockbuster.pl:  program to reformat the unique id of read profiles
7.  compBlockStat.pl:  program to compute variuos properties of read profiles such as entropy
8.  indexBed.sh:  program to index the mapped reads in bed format for fast access
9.  rnaExpAna.pl:  program to analyze the reads profiles for differential and coherent processing
10. estimateSizeFactor.pl:  program to perform RLE normalization of read profile expression
11. organize2bed.pl:  program to organize results from multiple files into a single file
12. plotDiffProcessing.pl:  program for graphical visualization of results
13. deepBlockAlign.x: program to align two read profiles

Installation


To install RPASuite, download RPASuite.tar.gz and unpack it. A directory, RPASuite will be created
1. tar -zxvf RPASuite.tar.gz
Now compile and create executables of deepBlockAlign and blockbuster
2. make or make all
Export environment variable 'RPAPATH' containing path to RPASuite installation directory
3. export RPAPATH=<path to RPAsuite installation directory>
Add 'RPAPATH' to your 'PATH' environment variable
4. export PATH=$PATH:$RPAPATH/bin
Add 'RPAPATH' to your 'PERL5LIB' environment variable
5. export PERL5LIB=$PERL5LIB:$RPAPATH/share/perl/
Setup a local python path (obsolete, only required for v0.01)
6. export PYTHONPATH=$HOME/lib/python2.7/site-packages
To permanently add or update the environment variable(s), add above four export commands in your ~/.bashrc file

Dependency


We assume that the following softwares are intalled and working: perl, python, R, latex, and gcc. As a minimum you would need to do the following on a ubuntu system
sudo apt-get install build-essential python2.7-dev python-numpy\
python-matplotlib libxml2-dev libgd-perl texlive
Install the needed perl modules
sudo cpan List::Util Tie::IxHash Statistics::Descriptive\
Statistics::RankCorrelation List::Compare GD::Simple
Install the needed R modules by entering R (type R on the command line) and then enter the following three commands (follow the instructions on the screen):
install.packages(c("clValid", "session", "amap", "pvclust", "snow", "optparse", "XML"))
source("http://bioconductor.org/biocLite.R")
biocLite(c("ctc", "DESeq"))
Download samtools, go to the download location and do
tar xjf samtools-1.1.tar.bz2
cd samtools-1.1
make -j10 prefix=$HOME install
Download bedtools, go to the download location and do
tar xzf BEDTools.v2.17.0.tar.gz
cd bedtools-2.17.0/
make -j 10
cp bin/* $HOME/bin
Download featureCounts (subread), go to the download location and do
tar xzf subread-1.4.6-p3-Linux-x86_64.tar.gz
cd subread-1.4.6-p3-Linux-x86_64
cp bin/featureCounts $HOME/bin
Download bedGraphToBigWig for your operating system, go to the download location and do
cp bedGraphToBigWig $HOME/bin
chmod 755 $HOME/bin/bedGraphToBigWig
Install pysam the pip python installer: (obsolete, only required for v0.01)
pip install pysam
Download htseq-count, go to the download location and do (obsolete, only required for v0.01)
tar zxf HTSeq-0.6.1.tar.gz 
cd HTSeq-0.6.1/
python setup.py install --prefix=$HOME

Genome annotations


The pipeline also requires the genomic coordinates of RNA and loci annotation in the reference genome. This is organized in two separate files containing
1. annotations/rna.<assembly>.bed.format.gz
    genomic coordinates of non-coding RNA in the genome assembly.
2. annotations/loci.<assembly>.bed.format.gz
    genomic coordinates of loci (exon, intron, UTRs) in the genome assembly.
Currently, the annotation for four assemblies: hg19, hg38, mm9 and mm10 is available with the download. The genomic coordinates of the respective annotations (RNA or loci) are based on the Ensembl annotations downloaded from ftp://ftp.ensembl.org/pub/
1. hg19: Ensembl release 75
2. hg38: Ensembl release 78
3. mm9: Ensembl release 67
4. mm10: Ensembl release 78
To build the two annotation files for a new genome assembly, we provide a script ensembl2annotation that takes annotation files from Ensembl as input. Example usage to build RNA and loci files:
1. create a directory within annotations with a unique assembly identifier like mm10
2. Download Ensembl ncRNA annotation like
   ftp://ftp.ensembl.org/pub//release-78/fasta/homo_sapiens/ncrna/Homo_sapiens.GRCh38.ncrna.fa.gz
3. ensembl2annotation -i Mus_musculus.GRCm38.ncrna.fa.gz -r 78 | gzip > rna.mm10.bed.format.gz
4. Download Ensembl gene annotation like
   ftp://ftp.ensembl.org/pub//release-78//gtf/mus_musculus/Mus_musculus.GRCm38.78.gtf.gz
5. ensembl2annotation -i Mus_musculus.GRCm38.78.gtf.gz -r 78 -l | gzip > loci.mm10.bed.format.gz
6. Add new file information in rnaProcessingAna script ("initialize genome annotation files" section)

Usage


RPASuite is called with the following parameters
rnaProcessingAna -i <input bam files separated by a comma> [OPTIONS]

Available options are:

Program: rnaProcessingAna (analyze multiple short RNA-seq samples to detect
differential and coherent RNA processing)
Author: RTH, University of Copenhagen, Denmark
Version: 0.02
Contact: sachin@rth.dk
Usage: rnaProcessingAna -i <files> [OPTIONS]
 -i <file>   [read alignment files in BAM format separated by a comma (minimum six)]
[OPTIONS]
 -o <dir>    [output directory (default: rpa)]
 -g <string> [genome assembly (default: hg19)]
             [currently supported: mm9, mm10, hg19 and hg38]
 -r          [biological replicates are present]
 -x          [run in parallel, meaning simultaneous analysis for each chromosome using one processor]
 -h          [help]
[OPTIONS (block group generation)]
 -d <int>    [minimum distance between two block groups (default: 50)]
 -u <int>    [minimum reads in the block group (default: 10)]
 -l <int>    [minimum reads in the block (default: 10)]
 -e <float>  [stddev scale for mapped reads (default: 0.5)]
 -t <string> [minimum block reads threshold (abs or rel, default: rel)]
 -q <int>    [maximum length of the block groups (default: 500)]
[OPTIONS (rna processing analysis)]
 -p <int>    [fraction of samples in which block group should be observed (default: 1)]
             [1: all samples]
 -c <float>  [cluster score threshold (default: 0.15)]
 -a <float>  [p-value threshold for fisher's exact test (default: 0.05)]
 -n          [create image file for differentially and coherently processed loci in pdf format]
             [this is time consuming and takes considerable hard disk space]

Example


An usage example of RPASuite is shown below. As input, the pipeline requires mapped reads in BAM format. Example dataset files are provided with the download
rnaProcessingAna -i\
data/test_data/BloodGm12878Chr6.bam,data/test_data/BrainSknshraChr6.bam,\
data/test_data/BreastMcf7Chr6.bam,data/test_data/CervixHelas3Chr6.bam,\
data/test_data/EpitheliumA549Chr6.bam,data/test_data/EscH1hescChr6.bam\
 -o data/test_run &>data/run.log

Input


As input, the pipeline requires mapped reads in BAM format. The name of the input files should be formatted as
Input file name: <unique id>.bam (example: BloodGm12878.bam)
If biological or technical replicates (atmost two) of a sample are present, the replicate information should be provided in the file name as
Input file name (replicate 1): <unique id><Rep1>.bam (example: BloodGm12878Rep1.bam)
Input file name (replicate 2): <unique id><Rep2>.bam (example: BloodGm12878Rep2.bam)
Note: The chromosome identifier in the input BAM files should start with chr, for example as chrY and not like Y.
Note: <unique id> should contain only alphanumeric characters.

Output


The results from the RPASuite are compiled in two text files:
a) RESULTS_DPL.TXT: It contains a list of differentially processed loci (DPL) in tab delimited format. Each line contains information for a differentially processed locus, organized into following columns
 1.  chr: chromosome
 2.  start: start coordinate
 3.  end: end coordinate
 4.  id: unique id
 5.  clusterScore: measures how well separated read profiles within a cluster 
                   are to rest of the profiles
 6.  strand: genomic strand
 7.  fScore: number of read profile clusters that showed significant p-value
             computed using Fisher's exact test
 8.  dbaScore: mean all vs all alignment score of read profiles
 9.  annotation: genomic annotation
 10. loci: genomic loci
 11. blocks: number of read blocks per read profile
 12. meanEntropy: mean randomness in the arrangement of reads within the read profiles
 13. meanExpr: mean normalized expression of reads profiles
 14. perObs: fraction of samples in which a read profile is oberved (1 means all)
 15. clusters: number of read profile clusters computed using pvclust at p-value < 0.05
 16. clusterInfo: comma separated list of clusters to which each read profile belongs 
A locus is defined as differentially processed, if its clusterScore ≥ 0.15 and fScore ≥ 1 (default)

b) RESULTS_CPL.TXT: It contains a list of coherently processed loci (CPL) in tab delimited format. Each line contains information for a coherently processed locus, organized in the same 16 columns as for DPL listed above. A locus is defined as coherently processed, if its clusterScore < 0.15 and dbaScore > 0.8 (default)

For easy access, the html version of the two result files (RESULTS_DPL.HTML and RESULTS_CPL.HTML) are also provided within the output directory

Contact


For queries, please contact sachin@rth.dk or gorodkin@rth.dk

Citation


Pundhir S and Gorodkin J. (2015) Differential and coherent processing patterns from small RNAs. Sci Rep. 5:12062. [PMID 26166713]

License


RPASuite: a computational pipeline to analyze the processing pattern of RNA
Copyright (C) 2014 Sachin Pundhir (sachin@rth.dk)

This program is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.

This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
GNU General Public License for more details.

You should have received a copy of the GNU General Public License
along with this program.  If not, see <http://www.gnu.org/licenses/>.