:: RPASuite ::

RPASuite: a computational pipeline to analyze the processing pattern of RNA

RPASuite (RNA Processing Analysis Suite) is a computational pipeline to identify differentially and coherently processed transcripts using RNA-seq data obtained from multiple tissue or cell lines.

Differentially Processed Loci (DPL): genomic loci which encode for transcripts that undergoes different post-transcriptional processing for a subset of tissues or cell lines in comparison to the rest, example miRNA arm switching.
Coherently Processed Loci (CPL): genomic loci which encode for transcripts that undergoes similar post-transcriptonal processing for all tissues or cell lines under study, example miRNA processed consistently from the same pre-miRNA strand.

Download

RPASuite v0.02 release (Jul 10, 2015)
1. analysis can be run in parallel, meaning simulataneous run for each chromosome (-x parameter)
2. now uses featureCount in place of htseq-count leading to reduced computational time
3. bug fix that was sometimes leading to inconsistent bourndaries of blocks within block group
RPASuite v0.01 release (Jan 29, 2015)

Programs and datasets

The directory 'RPASuite_v0.02' contains following programs

1.  rnaProcessingAna: main program to run the pipeline (requires bash environment)
2.  getBG:  program to define block groups or read profiles using blockbuster
3.  bam2bed.pl:  program to convert bam into bed format
4.  blockbuster.x:  program to group closely spaced reads into read profiles
5.  flagCluster.pl:  program to flag read profiles with their genomic annotation
6.  validateBlockbuster.pl:  program to reformat the unique id of read profiles
7.  compBlockStat.pl:  program to compute variuos properties of read profiles such as entropy
8.  indexBed.sh:  program to index the mapped reads in bed format for fast access
9.  rnaExpAna.pl:  program to analyze the reads profiles for differential and coherent processing
10. estimateSizeFactor.pl:  program to perform RLE normalization of read profile expression
11. organize2bed.pl:  program to organize results from multiple files into a single file
12. plotDiffProcessing.pl:  program for graphical visualization of results
13. deepBlockAlign.x: program to align two read profiles

Installation

To install RPASuite, download RPASuite.tar.gz and unpack it. A directory, RPASuite will be created

1. tar -zxvf RPASuite.tar.gz

Now compile and create executables of deepBlockAlign and blockbuster

2. make or make all

Export environment variable 'RPAPATH' containing path to RPASuite installation directory

3. export RPAPATH=<path to RPAsuite installation directory>

Add 'RPAPATH' to your 'PATH' environment variable

4. export PATH=$PATH:$RPAPATH/bin

Add 'RPAPATH' to your 'PERL5LIB' environment variable

5. export PERL5LIB=$PERL5LIB:$RPAPATH/share/perl/

Setup a local python path (obsolete, only required for v0.01)

6. export PYTHONPATH=$HOME/lib/python2.7/site-packages

To permanently add or update the environment variable(s), add above four export commands in your ~/.bashrc file

Dependency

We assume that the following softwares are intalled and working: perl, python, R, latex, and gcc. As a minimum you would need to do the following on a ubuntu system

sudo apt-get install build-essential python2.7-dev python-numpy\
python-matplotlib libxml2-dev libgd-perl texlive

Install the needed perl modules

sudo cpan List::Util Tie::IxHash Statistics::Descriptive\
Statistics::RankCorrelation List::Compare GD::Simple

Install the needed R modules by entering R (type R on the command line) and then enter the following three commands (follow the instructions on the screen):

install.packages(c("clValid", "session", "amap", "pvclust", "snow", "optparse", "XML"))
source("http://bioconductor.org/biocLite.R")
biocLite(c("ctc", "DESeq"))

Download samtools, go to the download location and do

tar xjf samtools-1.1.tar.bz2
cd samtools-1.1
make -j10 prefix=$HOME install

Download bedtools, go to the download location and do

tar xzf BEDTools.v2.17.0.tar.gz
cd bedtools-2.17.0/
make -j 10
cp bin/* $HOME/bin

Download featureCounts (subread), go to the download location and do

tar xzf subread-1.4.6-p3-Linux-x86_64.tar.gz
cd subread-1.4.6-p3-Linux-x86_64
cp bin/featureCounts $HOME/bin

Download bedGraphToBigWig for your operating system, go to the download location and do

cp bedGraphToBigWig $HOME/bin
chmod 755 $HOME/bin/bedGraphToBigWig

Install pysam the pip python installer: (obsolete, only required for v0.01)

pip install pysam

Download htseq-count, go to the download location and do (obsolete, only required for v0.01)

tar zxf HTSeq-0.6.1.tar.gz 
cd HTSeq-0.6.1/
python setup.py install --prefix=$HOME

Genome annotations

The pipeline also requires the genomic coordinates of RNA and loci annotation in the reference genome. This is organized in two separate files containing

1. annotations/rna.<assembly>.bed.format.gz
    genomic coordinates of non-coding RNA in the genome assembly.
2. annotations/loci.<assembly>.bed.format.gz
    genomic coordinates of loci (exon, intron, UTRs) in the genome assembly.

Currently, the annotation for four assemblies: hg19, hg38, mm9 and mm10 is available with the download. The genomic coordinates of the respective annotations (RNA or loci) are based on the Ensembl annotations downloaded from ftp://ftp.ensembl.org/pub/

1. hg19: Ensembl release 75
2. hg38: Ensembl release 78
3. mm9: Ensembl release 67
4. mm10: Ensembl release 78

To build the two annotation files for a new genome assembly, we provide a script ensembl2annotation that takes annotation files from Ensembl as input. Example usage to build RNA and loci files:

1. create a directory within annotations with a unique assembly identifier like mm10
2. Download Ensembl ncRNA annotation like
   ftp://ftp.ensembl.org/pub//release-78/fasta/homo_sapiens/ncrna/Homo_sapiens.GRCh38.ncrna.fa.gz
3. ensembl2annotation -i Mus_musculus.GRCm38.ncrna.fa.gz -r 78 | gzip > rna.mm10.bed.format.gz
4. Download Ensembl gene annotation like
   ftp://ftp.ensembl.org/pub//release-78//gtf/mus_musculus/Mus_musculus.GRCm38.78.gtf.gz
5. ensembl2annotation -i Mus_musculus.GRCm38.78.gtf.gz -r 78 -l | gzip > loci.mm10.bed.format.gz
6. Add new file information in rnaProcessingAna script ("initialize genome annotation files" section)

Usage

RPASuite is called with the following parameters

rnaProcessingAna -i <input bam files separated by a comma> [OPTIONS]

Available options are:

Program: rnaProcessingAna (analyze multiple short RNA-seq samples to detect
differential and coherent RNA processing)
Author: RTH, University of Copenhagen, Denmark
Version: 0.02
Contact: sachin@rth.dk
Usage: rnaProcessingAna -i <files> [OPTIONS]
 -i <file>   [read alignment files in BAM format separated by a comma (minimum six)]
[OPTIONS]
 -o <dir>    [output directory (default: rpa)]
 -g <string> [genome assembly (default: hg19)]
             [currently supported: mm9, mm10, hg19 and hg38]
 -r          [biological replicates are present]
 -x          [run in parallel, meaning simultaneous analysis for each chromosome using one processor]
 -h          [help]
[OPTIONS (block group generation)]
 -d <int>    [minimum distance between two block groups (default: 50)]
 -u <int>    [minimum reads in the block group (default: 10)]
 -l <int>    [minimum reads in the block (default: 10)]
 -e <float>  [stddev scale for mapped reads (default: 0.5)]
 -t <string> [minimum block reads threshold (abs or rel, default: rel)]
 -q <int>    [maximum length of the block groups (default: 500)]
[OPTIONS (rna processing analysis)]
 -p <int>    [fraction of samples in which block group should be observed (default: 1)]
             [1: all samples]
 -c <float>  [cluster score threshold (default: 0.15)]
 -a <float>  [p-value threshold for fisher's exact test (default: 0.05)]
 -n          [create image file for differentially and coherently processed loci in pdf format]
             [this is time consuming and takes considerable hard disk space]

Example

An usage example of RPASuite is shown below. As input, the pipeline requires mapped reads in BAM format. Example dataset files are provided with the download

rnaProcessingAna -i\
data/test_data/BloodGm12878Chr6.bam,data/test_data/BrainSknshraChr6.bam,\
data/test_data/BreastMcf7Chr6.bam,data/test_data/CervixHelas3Chr6.bam,\
data/test_data/EpitheliumA549Chr6.bam,data/test_data/EscH1hescChr6.bam\
 -o data/test_run &>data/run.log

Input

As input, the pipeline requires mapped reads in BAM format. The name of the input files should be formatted as

Input file name: <unique id>.bam (example: BloodGm12878.bam)

If biological or technical replicates (atmost two) of a sample are present, the replicate information should be provided in the file name as

Input file name (replicate 1): <unique id><Rep1>.bam (example: BloodGm12878Rep1.bam)
Input file name (replicate 2): <unique id><Rep2>.bam (example: BloodGm12878Rep2.bam)

Note: The chromosome identifier in the input BAM files should start with chr, for example as chrY and not like Y.
Note: <unique id> should contain only alphanumeric characters.

Output

The results from the RPASuite are compiled in two text files:
a) RESULTS_DPL.TXT: It contains a list of differentially processed loci (DPL) in tab delimited format. Each line contains information for a differentially processed locus, organized into following columns

 1.  chr: chromosome
 2.  start: start coordinate
 3.  end: end coordinate
 4.  id: unique id
 5.  clusterScore: measures how well separated read profiles within a cluster 
                   are to rest of the profiles
 6.  strand: genomic strand
 7.  fScore: number of read profile clusters that showed significant p-value
             computed using Fisher's exact test
 8.  dbaScore: mean all vs all alignment score of read profiles
 9.  annotation: genomic annotation
 10. loci: genomic loci
 11. blocks: number of read blocks per read profile
 12. meanEntropy: mean randomness in the arrangement of reads within the read profiles
 13. meanExpr: mean normalized expression of reads profiles
 14. perObs: fraction of samples in which a read profile is oberved (1 means all)
 15. clusters: number of read profile clusters computed using pvclust at p-value < 0.05
 16. clusterInfo: comma separated list of clusters to which each read profile belongs

A locus is defined as differentially processed, if its clusterScore ≥ 0.15 and fScore ≥ 1 (default)

b) RESULTS_CPL.TXT: It contains a list of coherently processed loci (CPL) in tab delimited format. Each line contains information for a coherently processed locus, organized in the same 16 columns as for DPL listed above. A locus is defined as coherently processed, if its clusterScore < 0.15 and dbaScore > 0.8 (default)

For easy access, the html version of the two result files (RESULTS_DPL.HTML and RESULTS_CPL.HTML) are also provided within the output directory

RPASuite: a computational pipeline to analyze the processing pattern of RNA
Copyright (C) 2014 Sachin Pundhir (sachin@rth.dk)

This program is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.

This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
GNU General Public License for more details.

You should have received a copy of the GNU General Public License
along with this program.  If not, see <http://www.gnu.org/licenses/>.