Mini Workshop: From mutations in RNA structure to 3D motifs detection

2014-06-30: Mini workshop. University of Copenhagen. From 13.00 to 14.30, Grønnegårdsvej 3; 2nd floor library. Speakers: Jermone Waldispuhl and Craig Zirbel

13.00-13.45: Algorithms for exploring the RNA mutational landscape
Jerome Waldishpuhl
School of Computer Science
McGill University, Montreal, Canada

Understanding the relationship between RNA sequences and structures is essential to decipher evolutionary processes, predict deleterious mutations and design synthetic molecules.
In this talk, we present a complete computational framework for exploring RNA sequence-structures maps in polynomial time and space.
Using statistical mechanics and weighted sampling techniques, we explore regions of the mutational landscape preserving the nucleotide composition and show how the GC-content influences the evolutionary accessible structural ensemble. Then, we illustrate the versatility of our techniques and apply them to (i) designing RNA sequences folding into target secondary structures, and (ii) to correcting sequencing errors in structured RNA sequences, in linear time and space.

13.45-14.30: Inference of recurrent RNA 3D motifs from sequence
Craig L. Zirbel
Department of Mathematics and Statistics
Bowling Green State University

Correct prediction of RNA 3D structure from sequence is a major challenge in biophysics. An important sub-goal is the inference of the presence of recurrent internal and hairpin "loops" from the sequences of structured RNAs, given a correct 2D structure. Different sequences can form the same 3D motif, and the same RNA motif can occur in different contexts, unrelated by homology.  We have established a pipeline to automatically extract all hairpin and internal loops from a non-redundant (NR) set of RNA 3D structures from the PDB and organize them into releases of the RNA 3D Motif Atlas.  Release 1.13 contains 276 internal loop motif groups and 253 hairpin loop motif groups.  We have verified that most of these groups are homogeneous in 3D, and thus matching a sequence to a motif group and providing a nucleotide level alignment amounts to a 3D prediction.

Matching sequences to motif groups can be done by calculating edit distance to known instances or by scoring with probabilistic models.  For the latter, for each motif group, we construct a probabilistic model for sequence variability based on a hybrid Stochastic Context-Free Grammar/Markov Random Field (SCFG/MRF) method we describe. To parameterize each model, we use all instances of the motif found in the NR dataset and knowledge of RNA nucleotide interactions, especially isosteric basepairs and their substitution patterns. SCFG techniques account for nested pairs and insertions, while MRF ideas handle local non-nested interactions, including base triples. We generate and score random sequences to calculate percentile rankings of the alignment scores for each model.

Given the sequence of an internal or hairpin loop from a secondary structure as input, we can match to known motifs using alignment score, edit distance to known sequence variants, or a combination of these factors.  Internal diagnostics demonstrate that the SCFG/MRF models are sufficiently distinct in sequence space that they correctly place individual sequences from 3D motif instances with the correct model most of the time.  A study of the match rate on test sets of randomly-generated sequences indicates that we can control the false positive rate.  Validation on sequence variants from multiple sequence alignments shows that we can match novel sequence variants to the presumed 3D geometry at a healthy rate.  Inputting multiple sequence variants of the same motif improves the accuracy of the identification.

While the current NR dataset already includes a wide range of recurrent motifs, new motifs are steadily appearing as new structures are solved. We have therefore created a pipeline to automatically identify new motifs in new structures as they are released by PDB and to update JAR3D and its data files.