Pcluster depends on programs in the Pfold and RNAdbtools packages, and
you need to have the Pfold and RNAdbtools executables in your path for 
pcluster to work.

By default the paths are set to ${SARSE_HOME} which will make Pcluster 
work from SARSE. To use Pcluster outside SARSE you can set the ${SARSE_HOME} variable in .bashrc or .bash_profile with "export SARSE_HOME=(PATH TO SARSE)".

Run make in the src directory to build the C++ scripts.

This package has the following files:

cluster.tcsh         Main clustering algorithm
makeplots.tcsh       Makes a series of dotplots of the resulting groups
makestruc.tcsh       Makes Pfold structure predictions with
                     parentheses based on each group
findbest             Gives the 'best' clustering (the program is used
                     by cluster.tcsh)
findmoves            Helper program for the cluster.tcsh
findstate            Helper program for the cluster.tcsh
runscfg.tcsh         Runs Pfold and gives a score
runscfg_pp.tcsh      Runs Pfold to make dotplot
runscfg_struc.tcsh   Runs Pfold to make structures
U1.fasta             Example file from RFAM.

The programs are either in tcsh (with a little awk in between) or
C++.

COMMENTS TO INDIVIDUAL PROGRAMS (examples follow at the bottom):

cluster.tcsh:

  This program clusters sequences into groups with similar secondary
  structure:

    - First argument is the fasta file to analyze

    - Optional second argument is a position range

  The output is a number of files (XXX is a base name formed from the
  first argument):

   - "XXX_groups_???.txt" has all clusterings from individual
     sequences to one large group. Each clustering takes up one file,
     with a group on each line with the score (for each sequence) next
     to it. '???' is the number of groups in the file.

   - "XXX_groups_max.txt" has the grouping with the highest score.

   - "XXX_groups_best.txt" has a grouping with a slightly lower score,
     but less groups.

   - "XXX_score.dat" is a data file for plotting of scores as a
     function of number of groups


makeplots.tcsh:

  Makes postscript dotplots from a group file:

    - First argument is the fasta file
    - Second argument is the group file to plot
    - Optional third argument is the position range

  The output is one or more postscript files with pairwise dotplots of
  the largest group against the others.


makestruc.tcsh:

  Makes a fasta structure file from a group file:

    - First argument is the fasta file
    - Second argument is the group file to plot
    - Optional third argument is the position range

  The output (to stdout) is the fasta file.


findbest:

  Finds the 'best' clustering from an 'XXX_score.dat' file:

    - First argument is the 'XXX_score.dat' file
    - Second argument is an adjustment factor, default is one

  The adjustment factor, a, adjusts the output from the maximum scoreing
  clustering (a = 0) to the clustering with all sequences in one group
  (a = infinity).



EXAMPLES

Go to examples_pcluster to do the following analyses.

Assume that we want to analyse the fasta file 'U1.fasta'. A first step
is to cluster it:

  cluster.tcsh U1.fasta

This gives a bunch of files:

  U1_groups_???.txt   These have all the clusterings
  U1_groups_max.txt   This has the maximum scoring clustering
  U1_groups_best.txt  This has the 'best' clustering
  U1_score.dat        The list of scores for the clusterings

We can now make dotplots of the results:

  makeplots.tcsh U1.fasta U1_groups_best.txt

This gives a two postscript files:

  U1_groups_best_plot_XX_vs_YY.ps

Where XX and YY refer to the group numbers in the file
'U1_groups_best.txt' (the group number refers to the line number in
the 'groups' file).

We can look at individual structures using:

  makestruc.tcsh U1.fasta U1_groups_best.txt > tmp.fasta

Which gives fasta files with the structures predicted from each group
independently.

We can plot the scores using gnuplot:

  plot 'U1_score.dat'   [to be typed in gnuplot]


We can also look at the clustering with only two groups:

  makeplots.tcsh U1.fasta U1_groups_002.txt



From the dotplots we see that the region from position 155 to 186 is
acting a little strange. Let us focus on that region. The original
clusterings were on the whole sequence level, so we are better off
making a new clustering only based on the region in question:

  cluster.tcsh U1.fasta 155-186

This clusters the sequences only by their structures in this
region. The whole sequence is included when making the tree used by
Pfold (the user don't have to worry about trees).

We can make dotplots of the results:

  makeplots.tcsh U1.fasta U1_155-186_groups_best.txt

This makes dotplots of the whole sequences, while the clustering was
only based on positions 155-186. We can also do dotplots for the
region we are looking at:

  makeplots.tcsh U1.fasta U1_155-186_groups_best.txt 155-186

This makes them a lot easier to look at.

Other regions that could be interesting for U1: "1-30,131-160" and
"46-105".
