
CMfinder Userguide 
================== 
1.  Getting started 
2.  Look inside CMfinder 
3.  Postprocessing Utility 
4.  Examples
5.  Running time
6.  Example with pscore calculation


1. Getting started
------------------

  For average users, the following perl script command is sufficient
  for most applications:

    cmfinder.pl [options] seq_file

  The output motifs are named seq_file.motif.* where the suffix
  indicates the number of stem-loops in the motif, and the motif
  order, E.g. seq_file.motif.h2.1 is the first double stem-loop motif.
  The corresponding covariance model is named as seq_file.cm.h2.1.

  The options and arguments are explained below:

  1.1. Input sequence file (seq_file)
  The input sequence file contains a list of genomic sequences in
  FASTA format.


  1.2. Output motif format
  Motifs in stored in Stockholm format
  (http://www.cgb.ki.se/cgb/groups/sonnhammer/Stockholm.html),
  with slight changes in the mark-up lines:
  
   #=GS <seqname> DE <start>..<end> <score>
   #=GS <seqname> WT <weight>
  
  to indicate the start/end position, alignment score, and weight of
  the motif.


  1.3 Running options  

    -s1 <number>     The max number of output single stem-loop motifs. Default 5
    -s2 <number>     The max number of output double stem-loop motifs. Default 5
          For complicated motifs with more than 2 stemloops, we suggest to produce simple motifs, 
          and combine them. The output motifs can overlap with each other, and the postprocessing
          routine can help to choose good and distinguished motifs. 
        
    -m1 <number>     The minimum length of single stemloop candidates. Default 30
    -M1 <number>     The maximum length of single stemloop candidates. Default 100
    -m2 <number>     The minimum length of double stemloop candidates. Default 40
    -M2 <number>     The maximum length of double stemloop candidates. Default 100
          The lower bound of motif length is 15bp, and the upper bound is 250bp.
          The actual output motifs can be slightly outside the specified range,
          because we do not enforce the range during refinement iteration.

    -c <number>      The maximum number of candidates in each sequence. Default 40. 
          The default value is usually sufficient for sequences with length < 500bp. 
          If your sequences are long (> 1K), and nothing interesting turns up, 
          try to increase the number of candidates proportionately, but no more than 100.

    -f <number>      Fraction of sequences containing the motif: default 0.8
          If it is expected that only a small fraction of sequences actually
          contain the motif, you can set the fraction to a small value (say 0.3) to
          affect its behavior. Usually, this parameter does not affect the
          performance if the motif is significant. But they do affect the order of 
          the output motifs.
   
    Postprocessing Options
    -combine <0|1>   Whether to combine the output motifs or not. Default 1. 
          See 3.1 for details.

    -filter          Filter the low scoring motif instances. Disabled by default
    -filter_w        Filter by motif weights. Only used with -filter        
    -filter_s        Filter by motif scores(CM alignment scores). Only used with -filter         
    -filter_th       Filtering threshold. 
          It specifies the lower bound of the motif weights or scores depending on 
          whether -filter_w or -filter_s is used. 
    
     -rank           Rank the output motifs. Disabled by default
           We summarize motifs by a few features (See 3.3 for details), and use
	   a heuristic function of these features for ranking. 
     
     -select         Select among the output motifs. Disabled by default.
     	   CMfinder usually output a large number of motifs for ncRNAs with 
	   complicated structures, and many of them overlap and have great similarities.
           This procedure select a subset of distinguished motifs with good quality.
	   The selected motifs are stored in the "select" sub directory under the 
	   current directory.                           
                     
      -v verbose
           If this option is set, CMfinder will save the intermediate results,
           and print running information and commands for the intermediate steps. 
	     


2 Look inside CMfinder 
----------------------
  CMfinder has 4 components: candf for searching candidates, cands for
  candidates comparison, canda for aligning selected candidates, and
  cmfinder for EM refinement.


  2.1 candf
  This program takes a fasta format sequence file as input, and outputs a 
  list of candidate motif instances for each sequence. A user can specfiy the 
  number of candidates, the minimum and maximum length of the motif, and the 
  number of stem-loops in the motif. The energy function computation is based 
  on Vienna.


  2.2 cands
  This program performs pairwise candidate comparison using a tree-edit 
  algorithm (the Vienna treedist function), then selects at most one candidate 
  from each sequence such that the sum of their pairwise distances are 
  relatively small. These candidates are then aligned by "canda" to create a 
  seed alignment for EM refinement. 
  The options for cands are:
  -n:  specify the multiple sets of candidates as seeds for EM.
       It is the same as the "-n" option when using "cmfinder.pl".	
  -f:  specifies the expected fraction of sequences that contain the motif 
       instances. This option helps to eliminate unlikely candidates from seed 
       alignment. This estimate does not need to be accurate. We recommend 
       setting it to a small value (e.g., 0.3) if the number of input sequences
       are large (e.g., > 30). 
       It is the same as the "-f" option when using "cmfinder.pl" 
  -m:  specifies a "match constraint file" to provide the anchor points for 
       alignment. Only the candidates that are consistent with the anchors are 
       compared. These anchors are computed using BLAST.
 

  2.3 canda
  The canda align all the candidates in the seed to a consensus candidate to 
  create an seed alignment. 


  2.4 cmfinder
  cmfinder performs EM refinement of a given alignment. The basic CM operations,
  i.e., CM alignment, model construction, parameterization are based on 
  Infernal 0.7. The usage of cmfinder is:

      cmfinder [-options] <seqfile input> <cmfile output>

  where cmfile contains the output covariance model. 

  The main options include:
  -a <align file> :  an initial (seed) motif alignment 
    The alignment should be in stockholm format by default. For description for
    Stockholm format. In the standard setting, this alignment is produced by 
    "canda", but it can be obtain from other sources, in which case you can 
    also use the --format option to set the alignment format. For example, 
    if you are going to use clustal format, use "--format clustal".
    cmfinder does not use the secondary structure annotation of the input 
    alignment, only the alignment.
  -i <cmfile> : an initial covariance model
    In some cases, you might like to provide a covariance model as a starting 
    point. You must specify either the alignment, or the covariance model as 
    input.
  -o <align file> : the output motif structural alignment in Stockholm format. 
    If no output alignment filename is provided, the alignment will be printed 
    to standard output.
  -c <candidate file>: the candidate file 
    CMfinder was originally designed as a two-phase process: first we only
    iterate among candidates found by "candf", then in the second phase, the 
    CM model is used to scan the whole sequence to identify new candidates. 
    In practice, however, the seed alignment is good enough to scan the 
    sequence directly, so by default, the first phase is skipped, unless the 
    user explicitly provides the candidate file. 

  If cmfinder is interrupted before it finishes, the latest covariance model is
  saved in "lastest.cm" in your working directory. You can use "-i lastest.cm" 
  to set the new starting point to resume cmfinder. 


3 Postprocessing Utility
-------------------------- 
  cmfinder.pl provides some support for postprocessing.
  You may need to adjust these precedures for fit your own requirements.   
	
  3.1 Combining motifs.
  To overcome the drawback of using candidates with simple structure only, we 
  designed heuristics to combine multiple motifs. The usage is the following:

    CombMotif.pl seq_file motif1 motif2 ...

  If all the motif files share the common prefix, you only need to use the 
  prefix. CombMotif.pl uses heuristics to determine which motifs need to be 
  combined and when. You can also manually determine which motifs to merge, 
  and use
  
    merge_motif.pl seq_file motif1 motif2 merged_file

  followed by cmfinder refinement, and apply this technique iteratively. 


  3.2 Adjusting motif boundaries
  You can adjust motif boundaries by using extend_motif. Usage:
      
    extend_motif [-l num] [-r num] motif_file sequence_file
  
  -l/-r specifies the amount of adjustment to the left/right. Positive numbers 
  append the flanking regions to the corresponding end, and negative numbers 
  for removing the columns at the corresponding end. If these parameters are 
  not set, extend_motif will first try to extend the motif with conserved 
  columns in the flanking regions; if not successful, it will shrink the motif 
  by first removing unstable base pair at both ends, then remove unconserved 
  unpaired columns. 

  
  3.3 Summarize motif feature
  CMfinder outputs multiple motifs. If every sequence contains a motif 
  instance, the top scoring motifs are usually the good ones as they provide 
  better coverage of the overall RNA structure. However, when the real motif 
  occurs only within a subset of sequences, it is not easy to determine which 
  is the best motif, e.g. some motifs are more conserved, but are only shared 
  by a small number of sequences, while others are less conserved, but more 
  prevalent. While there is no formal quality measure for CMfinder motifs yet, 
  we can collect motif features that are helpful for evaluation. 
  Some useful ones are:
  a. Num = the number of motif instances
     Weight = sum of their weights. 
  b. Len = average length of the motifs.
  c. Score = average alignment score.
  d. BP.Pr = the sum of partition function probabilities for base paired columns. 
  e. BP = the number of base pairs in the consensus secondary structure.  f. Energy = average folding energy.
  g. Seq_id = the average pairwise similarity
  h. Conserved_pos = the number of columns in all locally conserved blocks. 
     Each block must contain at least 4 columns that are conserved 
     (>70% identical).   	

  The usage of this function is:
    summarize  [-w] motif_file
  The output is written to the standard output.
  If -w option is used, the motif instances are weighted to compute the summary statistics.   

  3.4 Rank CMfinder motifs. 
  The usage of this function is:

      rank_cmfinder.pl [option] <motif_pattern>  <motif.summary>

  motif_pattern is a PERL regular expression that specifies a set of motifs. 
  e.g. "\S+motif\.\S+" (include the quotes) for all files with "motif.",
  or simply the prefix of all motifs if that is what you want. 
  motif.summary is the output file, columns delimited by comma. The last column
  specfies the ranking scores, and previous columns the motifs features as described in 3.3
  This scoring function is not used in the original CMfinder paper, but in the 
  pipeline study of Bacteria 
  (http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=1913097)  
  The options are 
   -rank  do ranking
   -weight do motif instances weighting (see -w optioin 3.3)
   -dir   the directory for motif instances to be ranked.  
  

  3.5 Select CMfinder motifs  
  The usage of this function is 
      select_motif.pl <motif.summary> [output_directory]
  It selects from the motifs included in motif.summary, which is the output of 3.4.
  Starting from the top ranking motifs, it only chooses the ones that are considerably
  different from (and not completely contained in) previous selected motifs. 
  By default the output directory is the current directory if it is not specified. 

  3.6 Filter CMfinder motifs
  You can potentially clean the motif alignments by removing low scoring motif instances.
  Usage:   	      
        filter.pl [option] <input_file> <output_file>
  Option:
        -w   Filter based on instance weights
	-s   Filter based on instance CM scores. 			       
        -t  <threshold>    Filtering threshold.

4 Examples
------------
Now we give a few examples to demonstrate how to use CMfinder.
To discover short motifs such as glmS in  "data/glmS.fasta", 
go to "data" directory, then
%   cmfinder.pl glmS.fasta
It will produce 5 single stemloop motifs and 5 double stemloop motifs, e.g. 
glmS.fasta.motif.h1.1, glmS.fasta.motif.h2.1.

Use the following commands if you want to rank motifs and select the good ones:
%  cmfinder.pl -rank -select glmS.fasta

It will produce a rank file glmS.fasta.summary and selected files are located in "select" dir.
In this case, it only contains one file - glmS.fasta.motif.0. The other motifs are too similar or are
parts of this motif. 


If you only need to output single stemloop motifs, and do not want to combine motifs, then
%  cmfinder.pl  -s1 5 -s2 0 -combine 0 glmS.fasta


5. Running time
===============
CMfinder is practical for searching RNA motifs < 200 bases in < 100 sequences 
with length < 1K. In our genome scan experiments, each dataset usually contains 
15-30 sequences with length 500 bases. If there are no interesting motifs, it 
usually terminates in a few minutes. If there are good complicated motifs, the 
combination process can takea considerable amount of time. CMfinder contains 
several steps, and each with different time complexity. The condidate comparison 
part is in the order of  #sequences^2 * #candidates^2. The EM iteration is in the 
order of motif_length^2 * seq_length * #sequences. For inputs such as 
"let-7.fasta", it takes a couple of minutes. For "S_box.fasta", however, 
depending on the number of output motifs and the combination process, 
can take more than an hour. 


6 Example with pscore calculation
------------
#go to "data" directory
cd data

#run CMfinder, output are covariance model and alignment in STOCKHOLM format for each motif, and a summary file
export CMfinder=<CMFINDER-ROOT-DIR>/bin
export BLAST=<BLAST-DIR>
$CMfinder/cmfinder.pl -rank glmS.fasta

#remove all ".pscore" files in current directory, calculate pscore for each motif in current directory, parse pscores from output files 
export Models=<CMFINDER-ROOT-DIR>/pscore
ls *.pscore > /dev/null 2>&1 && rm *.pscore
for i in *.motif.*; do $CMfinder/posterior --partition -t glmS.newick $i > $i.pscore; done
find . -name "*.pscore" -exec grep -H "Total pair posterior" {} \;
