oformat

Next Generation Sequencing-Related Formats

RNA Structure Format

Detailed definition can be found here: http://projects.binf.ku.dk/pgardner/bralibase/RNAformats.html. Commonly used formats are:

Connect format (CT)
Dot-bracket format
Stockholm Format

Connect format (CT)

Columns 1, 3, 4, and 6 redundantly give sequence indices, the informative columns 2 and 4 give the sequence and 'j' in position 'i' if (i,j) is a base-pair, otherwise this is zero. One could envisage encoding multiple (aligned) sequences and structures in this format by alternating sequence and structure columns in the one file.

Example:

   73 ENERGY =     -17.50    S.cerevisiae_tRNA-PHE
    1 G       0    2   72    1
    2 C       1    3   71    2
    3 G       2    4   70    3
    4 G       3    5   69    4
    5 A       4    6   68    5
    6 U       5    7   67    6
    7 U       6    8   66    7
    8 U       7    9    0    8
                 .
                 .
                 .
   66 A      65   67    7   66
   67 A      66   68    6   67
   68 U      67   69    5   68
   69 U      68   70    4   69
   70 C      69   71    3   70
   71 G      70   72    2   71
   72 C      71   73    1   72
   73 A      72   74    0   73

Dot-bracket format

Matching parentheses in positions 'i' and 'j' indicate a base-pair, otherwise a '.' is used. Many people complain that this format cannot represent pseudo-knots in an un-ambiguous fashion, however using additional parenthese types '[', ']', '{', '}', '<', '>', 'A', 'a', 'B', 'c', 'C', ... one can represent extremely high order knots in an un-ambiguous fashion. Alternatively, as Sean Eddy discusses in his Infernal documentation these can be used to 'mark-up' the structure to discriminate between different loop types.

>S.cerevisiae_tRNA-PHE M10740/1-73
GCGGAUUUAGCUCAGUUGGGAGAGCGCCAGACUGAAGAUUUGGAGGUCCUGUGUUCGAUCCACAGAAUUCGCA
(((((((..((((........)))).((((.........)))).....(((((.......)))))))))))). (-17.50)

Conversion between CT format and DOT format

(1) From CT format to DOT format The conversion is very easy. Just check the values of column 1 and column 5:

column 5 is 0, print '.'
column 5 is not 0 and larger than column 1, print '('
column 5 is not 0 and smaller than column 1, print ')'

Source code:

    def ct2dot(ctname):
        '''
        CT format to DOT format.
        Input can be CT file names or a CT file string.
        '''
        structures = []
        scores = []
        try: # file name
            fh = open(ctname)
            cts = fh.readlines()
            fh.close()
        except: # string of CTs
            cts = ctname.split('\n')
        if 'ENERGY' not in cts[0]: # No structure
            print >> sys.stderr, "No structure predicted by Fold (RNAStructure package)."
            return structures
        score = 0
        for line in cts:
            if 'ENERGY' in line: # Parse header
                if score != 0:
                    structures.append(lstr.tostring())
                    scores.append(score)
                fields = line.split()
                cnt = int(fields[0]) # sequence length
                score = float(fields[3]) # energy
                lstr = numpy.repeat('.', cnt) # structure
                i = 0
            else: # Read structure information
                s = int(line.split()[4]) # the first pairing index in CT format.
                i += 1
                if s != 0:
                    lstr[i] = i < s and '(' or ')'
        if score != 0:
            structures.append(lstr.tostring())
            scores.append(score)
        return (structures,scores)
    ct2dot=staticmethod(ct2dot)

(2) From DOT format to CT format We need a stack to achieve this. Iterate every character in DOT string,

if '.', pass
if '(', push the index into stack
if ')', pop one index, named i, from stack. The current index is named j. Keep record of the pair <i,j> and <j,i>

Finally we can generate the CT file according to the pairs we have got.

Source code:

    ct2dot=staticmethod(ct2dot)
    def dot2ct(tfasts): #
        ''' Convert DOT format to CT format. '''
        ctstring = []
        for st, sc  in izip(tfasts.structures, tfasts.scores):
            # print header
            ctstring.append ("%5d  ENERGY = %-3.2f  %s" % (len(tfasts), sc, tfasts.name))
            pairs = {}
            stack = []
            for i, c in enumerate(st):
                if c == '(':
                    stack.append(i+1)
                elif c == ')':
                    pairs[i+1] = stack.pop()
                    pairs[pairs[i+1]] = i+1
            for i in xrange(1,len(tfasts)+1): 
                ctstring.append( " %4d %s %7d%4d %4d %4d" % (i, tfasts.seq[i-1], i-1, i+1, pairs.get(i,0),i) )
        return '\n'.join(ctstring)
    dot2ct=staticmethod(dot2ct)

Conversion between CT format and DOT format (pseudoknots supported)

(1) From CT format to DOT format The conversion is very easy. Just check the values of column 1 and column 5:

The first seen pairs are treated as normal pairs.
The following conflict ones are treated as pseudoknots.

For example:

The following two structures may present the same structure but we prefer to use the first presentation.

(1) "...(((...[[[..)))...]]]..."
(2) "...[[[...(((..]]]...)))..."

Source code:

    def ct2dot(ctname):
        '''
        CT format to DOT format.
        Input can be CT file names or a CT file string.
        For pseudoknots like this:
            '(((..[[[..)))...]]]'
            The first seen pairs are labelled with '()', and the following conflict ones are labelled as '[]'
        '''
        structures = []
        scores = []
        try: # file name
            fh = open(ctname)
            cts = fh.readlines()
            fh.close()
        except: # string of CTs
            cts = ctname.split('\n')
        if cts[0].lower().find('energy') == -1: # No structure
            print >> sys.stderr, "No structure predicted by Fold (RNAStructure package)."
            return (structures, scores)
        score = 0
        idx = 0
        while True:
            if cts[idx].lower().find('energy') != -1:
                fields = cts[idx].split()
                cnt = int(fields[0]) # sequence length
                score = float(fields[3]) # energy
                cons = numpy.repeat('.', cnt) # structure
                idx += 1
                # read structure information
                bs = [cnt+1]
                for i in range(cnt):
                    a = i+1
                    b = int (cts[idx].split()[4])
                    idx +=1
                    if a > b:
                        try:
                            bs.remove(a)
                        except:
                            pass
                        continue
                    minr = min(bs)
                    if b > minr: # possible pseudoknots
                        cons[a-1] = '['
                        cons[b-1] = ']'
                    else:
                        cons[a-1] = '('
                        cons[b-1] = ')'
                        bs.append(b)
                structures.append(cons.tostring())
                scores.append(score)
            else:
                idx += 1
            if idx >= len(cts):
                break
        return (structures,scores)
    ct2dot=staticmethod(ct2dot)

(2) From DOT format to CT format We need a stack to achieve this. Iterate every character in DOT string,

if '.', pass
if '(', push the index into stack 1
if '[', push the index into stack 2
if ')', pop one index, named i, from stack 1. The current index is named j. Keep record of the pair <i,j> and <j,i>
if ']', pop one index, named i, from stack 2. The current index is named j. Keep record of the pair <i,j> and <j,i>

Finally we can generate the CT file according to the pairs we have got.

    def dot2ct(tfasts): #
        ''' Convert DOT format to CT format. '''
        ctstring = []
        for st, sc  in izip(tfasts.structures, tfasts.scores):
            # print header
            ctstring.append ("%5d  ENERGY = %-3.2f  %s" % (len(tfasts), sc, tfasts.name))
            stack1=[]
            stack2=[]
            pairs={}
            for i,c in enumerate(st):
                if c == '(':
                    stack1.append(i+1)
                elif c == '[':
                    stack2.append(i+1)
                elif c == ')':
                    pairs[i+1] = stack1.pop()
                    pairs[pairs[i+1]] = i+1
                elif c == ']':
                    pairs[i+1] = stack2.pop()
                    pairs[pairs[i+1]] = i+1
            for i in xrange(1,len(tfasts)+1): # ###24#A######23###25####0###24
                ctstring.append( " %4d %s %7d%4d %4d %4d" % (i, tfasts.seq[i-1], i-1, i+1, pairs.get(i,0),i) )
        return '\n'.join(ctstring)
    dot2ct=staticmethod(dot2ct)

RSQ

Installation

Data processing

Data format

IO

Readers

Friend Links

Bam2X

NGSLib

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

oformat

Next Generation Sequencing-Related Formats

RNA Structure Format

Connect format (CT)

Dot-bracket format

Conversion between CT format and DOT format

Conversion between CT format and DOT format (pseudoknots supported)

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

RSQ

Friend Links

Clone this wiki locally