These are the datasets used in the following paper:

Elfar Torarinsson, Jakob H. Havgaard and Jan Gorodkin
Multiple structural alignment and clustering of RNA sequences



There are 5 different datasets:

  -global_cmfinder - This is a global dataset with 17 of the families
                     from the CMfinder paper

	-The .fa files are the fasta files
	-The .tab files are tab delimited files with the id, sequence
	 and the secondary structure as annotated in Rfam

  -global_comprehensive - A more comprehensive set with 3 families

	-Each family contains a .fa and a .tab file along with several
	 folders. These folders are named X_Y where X=number of
	 sequences per file and Y=the upper limit of the sequence
	 similarity range (and the lower limit is 20 lower). Each of
	 these folders contains 20 files for the given X and Y values
	 (when possible to generate).

  -global_clustering - A test set for global clustering

	-The folder contains 10 datasets where 5-9 sequences are
	 randomly chosen from each of the 19 families in the CMfinder
	 dataset. In addition there is one set where we shuffled 3 of
	 the families and added the to set number 10.

  -local_pairwise_clustering - Output from several pairwise Foldalign runs

	-This set contains the file set_4.fa which contains the
	 motifs, and their flanking 30 nts, locally found by pairwise
	 Foldalign by scanning 31 pairs which were nts 500 long. The
	 .tab files contains the manually curated Rfam structures for
	 these 62 (plus more) ncRNAs.

  -local_clustering - 20 sequences from 3 families in their 500 long
   genomic context

	-The .fa file is the fasta file for these 20 sequences and the
	 .tab file contains the manually curated Rfam structures for
	 these 20 (plus more) sequences.



If you have any questions or comments please send them to elfar7@gmail.com
