Colformat

  Home

  Usage

  Example

  RNAdbtools

  Sarse

  Resources

Motivation

Column format files are text based files for biological sequences. The idea being, that they should be easy to work with, rather than compact. Another important feature is that many different kinds of information can be contained in the same file. Also converting this format into anything else should be simple.

General description

The data is organized in columns. Each entry (sequence) is along with its various assignments arranged in columns. Entry consist of a header that contains information about the column organization as well as miscellaneous information about the sequences. The entire file (database) contain a main header that can be used to describe overall features. In column format files, everything but the sequence positions is on lines that begin with i semicolon.

The entry information has some fields that are compulsory. The first line of the entry info must show what type of molecule is in the entry (the ``TYPE'' field). The next lines show what information is in the different columns of the entry. After this, the entry name comes. This is followed by additional information:

A column format file starts with a header containing info on the file. This header is followed by line with a semicolon and at least 10 equality signs (=). A header could look like this:

; This file contains a database of globin genes ; ; It is located at http://www.xyz.xyz ; ; ========================================================================

The sequence entry headers begin with a "TYPE" field indicating whether the entry is an RNA sequence or a specific entry describing the basepairings of the alignment. That is the first line in an entry should be of the form

; TYPE <type>
where type here is defined as either "RNA, DNA, DNA_blast, and PROTEIN". There is no distinction between upper and lower case letters. Lines on the form
; COL <number> <word>
indicate that column <word> is described in column <number>. Each Entry have an "ENTRY" field on the form
; ENTRY <one_word>
Other lines in the header describe miscellaneous features and have the form
; <ONE_TAG> <string>
Header and columns are separated by a line of the type
; ---------
with at least 10 dashes. The column lines are organized on form
<word(COL 1)> <word(COL 2)> . . . <word(COL N)>
for N columns. Entries are ended by a line of the type
; **********
with at least 10 *'s.

A description of the column types can be found on the colusage page along with a listing of which programs from the rnadbtool page. Examples on how to use those programs can found here.

Database example

An example from the tmRNA database alignment of RNA sequences.
; The tmRNA Database version 043 (January 2001): ; ---------------------------------------------- ; ; ; Availability: ; ------------- ; ; . . . . ; . . . . ; ; ======================================================================== ; TYPE pairingmask ; COL 1 label ; COL 2 residue ; COL 3 alignpos ; ENTRY pairingmask ; ---------- M a 1 M a 2 M a 3 M a 4 M a 5 . . . . . . M - 675 M - 676 ; ********** ; TYPE RNA ; COL 1 label ; COL 2 residue ; COL 3 seqpos ; COL 4 alignpos ; COL 5 align_bp ; ENTRY AQU.AEO. ; ORGANISM Aquifex aeolicus ; ACCESSION AE000657 + AE000749 ; WWW-ACCESS http://www.ncbi.nlm.nih.gov:80/entrez/query.fcgi?cmd=Retrieve& db=Nucleotide&list_uids=6626248&dopt=GenBank + http://www.ncbi.nlm.nih.gov:80/ entrez/query.fcgi?cmd=Retrieve&db=Nucleotide&list_uids=02983975&dopt=GenBank ; LINEAGE (NCBI) CELLULAR ORGANISMS; BACTERIA; AQUIFICALES; AQUIFICACEAE; AQUIFEX ; ---------- N G 1 1 672 N G 2 2 671 N G 3 3 670 N G 4 4 669 N G 5 5 668 N C 6 6 667 N G 7 7 666 N g 8 8 . N a 9 9 . G - . 10 . N a 10 11 . N a 11 12 . N g 12 13 . N g 13 14 . . . . . . . . . . .
The first entry describes the pairing mask (stem helix mapping) of the structural alignment (and is obtained using txt2col with "-m" option). In the second entry the first column indicates whether the entry (sequence) at a particular alignment positions contains a gap or a nucleotide. When TYPE is RNA or DNA, label takes the values "N" and "G". "residue" refers to the individual nucleotides

Examples of program output in col format

Shown here are examples of column file output from diffferent programs. Only the beginning of the files are shown.

Output from txt2col:

; Generated by txt2col ; ======================================================================== ; TYPE RNA ; COL 1 label ; COL 2 residue ; COL 3 seqpos ; COL 4 alignpos ; COL 5 align_bp ; ENTRY Aqu.aeo. ; ---------- N G 1 1 665 N G 2 2 664 N G 3 3 663 N G 4 4 662 N G 5 5 661 N C 6 6 660 N G 7 7 659 N g 8 8 . N a 9 9 . G - . 10 . N a 10 11 .

Output from ct2col:

; Generated by ct2col ; ======================================================================== ; TYPE RNA ; COL 1 label ; COL 2 residue ; COL 3 seqpos ; COL 4 align_bp ; ENTRY mtu ; LENGTH 168 ; ---------- N C 1 . N U 2 . N U 3 17 N C 4 16 N G 5 15 N C 6 14 N A 7 . N U 8 . N C 9 . N A 10 .

Output from gb2col:

; File generated by gb2col ; ======================================================================== ; TYPE DNA ; COL 1 label ; COL 2 residue ; COL 3 seqpos ; ENTRY MTU88049 ; LENGTH 2805 ; ACCESSION U88049 ; ---------- N t 1 N t 2 N g 3 N g 4 N g 5 N c 6 N c 7 N g 8 N c 9 N c 10

Output from blast2col:

; Generated by blast2col ; ======================================================================== ; TYPE DNA_blast ; COL 1 label ; COL 2 query_residue ; COL 3 match ; COL 4 subject_residue ; COL 5 query_seqpos ; COL 6 subject_seqpos ; ENTRY MTU88049_vs_U88049 ; BLAST_VERSION BLASTN 2.0.11 [Jan-20-2000] ; QUERY MTU88049 ; QUERY_LENGTH 200 ; SUBJECT U88049 ; SUBJECT_COMMENT MTU88049 2805 bp DNA BCT U88049 . ; SUBJECT_STRAND Plus ; SUBJECT_LENGTH 2805 ; ALIGNMENT_LENGTH 200 ; SCORE 396 ; EXPECT 1e-109 ; IDENTITIES 200 ; ---------- N T - T 1 1 N T - T 2 2 N G - G 3 3 N G - G 4 4 N G - G 5 5 N C - C 6 6 N C - C 7 7 N G - G 8 8 N C - C 9 9 N C - C 10 10

Comments, questions, etc., email gorodkin@rth.dk.

Last updated March 26th, 2007 by Jan Gorodkin