Center for non-coding RNA in Technology and Health

Protein Sequence Logos using Relative Entropy

The sequence part applied in the RNA structure logos is applied to protein sequence logos. The standard sequence logo by Schneider and Stephens have been extended to cope with any prior amino acid distribution as well as allowing for gaps in the (multiple) alignments of protein sequences. The total height of the sequence information part is computed as the relative entropy between the observed fractions of a given symbol and the respective a priori probabilities, with the constraint that the a priori ``probability'' of the gap always is one. The a priori probabilities for the amino acids sum to one. Note that this might lead to negative ``information'' if sufficiently many gaps are present at a given position. The height of each symbol can be displayed in two ways: ``type 1 logo'' where the height is proportional to its frequency, or ``type 2 logo'' where the height is in proportion to the fraction of the observed frequency and the expected (a priori) frequency. In both cases, when a symbol appears less than expected the symbol will be displayed up-side-down. You can get the script here or you can ``click in'' your alignment below. For usage please quote

J. Gorodkin, L. J. Heyer, S. Brunak and G. D. Stormo. Displaying the information contents of structural RNA alignments: the structure logos. Comput. Appl. Biosci., Vol. 13, no. 6 pp 583-586, 1997.

T. D. Schneider and R. M. Stephens. Sequence logos: a new way to display consensus sequences. Nucleic Acids Research, Vol. 18, No 20, p. 6097-6100. (Also check out Tom Schneiders page.)

You can also ``click in'' your multiple protein alignment below. The final logo in postscript can then be downloaded. You can see an example of the data format here . You are welcome to send your comments or bug reports to webmaster@rth.dk. The a priori probabilities for amino acids must be greater or equal to equal zero. One line of probabilities result in the same background distribution to be used throughout the alignment. Alternatively enter as many lines as there are positions in the alignment, corresponding to a position wise background distribution of nucleotides.