Skip to content

oformat

Wang Yunfei edited this page Jun 6, 2017 · 2 revisions

Next Generation Sequencing-Related Formats

UCSC formats

RNA Structure Format

Detailed definition can be found here: http://projects.binf.ku.dk/pgardner/bralibase/RNAformats.html. Commonly used formats are:

  • Connect format (CT)
  • Dot-bracket format
  • Stockholm Format

Connect format (CT)

Columns 1, 3, 4, and 6 redundantly give sequence indices, the informative columns 2 and 4 give the sequence and 'j' in position 'i' if (i,j) is a base-pair, otherwise this is zero. One could envisage encoding multiple (aligned) sequences and structures in this format by alternating sequence and structure columns in the one file.

Example:

   73 ENERGY =     -17.50    S.cerevisiae_tRNA-PHE
    1 G       0    2   72    1
    2 C       1    3   71    2
    3 G       2    4   70    3
    4 G       3    5   69    4
    5 A       4    6   68    5
    6 U       5    7   67    6
    7 U       6    8   66    7
    8 U       7    9    0    8
                 .
                 .
                 .
   66 A      65   67    7   66
   67 A      66   68    6   67
   68 U      67   69    5   68
   69 U      68   70    4   69
   70 C      69   71    3   70
   71 G      70   72    2   71
   72 C      71   73    1   72
   73 A      72   74    0   73

Dot-bracket format

Matching parentheses in positions 'i' and 'j' indicate a base-pair, otherwise a '.' is used. Many people complain that this format cannot represent pseudo-knots in an un-ambiguous fashion, however using additional parenthese types '[', ']', '{', '}', '<', '>', 'A', 'a', 'B', 'c', 'C', ... one can represent extremely high order knots in an un-ambiguous fashion. Alternatively, as Sean Eddy discusses in his Infernal documentation these can be used to 'mark-up' the structure to discriminate between different loop types.

>S.cerevisiae_tRNA-PHE M10740/1-73
GCGGAUUUAGCUCAGUUGGGAGAGCGCCAGACUGAAGAUUUGGAGGUCCUGUGUUCGAUCCACAGAAUUCGCA
(((((((..((((........)))).((((.........)))).....(((((.......)))))))))))). (-17.50)

Conversion between CT format and DOT format

(1) From CT format to DOT format The conversion is very easy. Just check the values of column 1 and column 5:

  • column 5 is 0, print '.'
  • column 5 is not 0 and larger than column 1, print '('
  • column 5 is not 0 and smaller than column 1, print ')'

Source code:

    def ct2dot(ctname):
        '''
        CT format to DOT format.
        Input can be CT file names or a CT file string.
        '''
        structures = []
        scores = []
        try: # file name
            fh = open(ctname)
            cts = fh.readlines()
            fh.close()
        except: # string of CTs
            cts = ctname.split('\n')
        if 'ENERGY' not in cts[0]: # No structure
            print >> sys.stderr, "No structure predicted by Fold (RNAStructure package)."
            return structures
        score = 0
        for line in cts:
            if 'ENERGY' in line: # Parse header
                if score != 0:
                    structures.append(lstr.tostring())
                    scores.append(score)
                fields = line.split()
                cnt = int(fields[0]) # sequence length
                score = float(fields[3]) # energy
                lstr = numpy.repeat('.', cnt) # structure
                i = 0
            else: # Read structure information
                s = int(line.split()[4]) # the first pairing index in CT format.
                i += 1
                if s != 0:
                    lstr[i] = i < s and '(' or ')'
        if score != 0:
            structures.append(lstr.tostring())
            scores.append(score)
        return (structures,scores)
    ct2dot=staticmethod(ct2dot)

(2) From DOT format to CT format We need a stack to achieve this. Iterate every character in DOT string,

  • if '.', pass
  • if '(', push the index into stack
  • if ')', pop one index, named i, from stack. The current index is named j. Keep record of the pair <i,j> and <j,i>

Finally we can generate the CT file according to the pairs we have got.

Source code:

    ct2dot=staticmethod(ct2dot)
    def dot2ct(tfasts): #
        ''' Convert DOT format to CT format. '''
        ctstring = []
        for st, sc  in izip(tfasts.structures, tfasts.scores):
            # print header
            ctstring.append ("%5d  ENERGY = %-3.2f  %s" % (len(tfasts), sc, tfasts.name))
            pairs = {}
            stack = []
            for i, c in enumerate(st):
                if c == '(':
                    stack.append(i+1)
                elif c == ')':
                    pairs[i+1] = stack.pop()
                    pairs[pairs[i+1]] = i+1
            for i in xrange(1,len(tfasts)+1): 
                ctstring.append( " %4d %s %7d%4d %4d %4d" % (i, tfasts.seq[i-1], i-1, i+1, pairs.get(i,0),i) )
        return '\n'.join(ctstring)
    dot2ct=staticmethod(dot2ct)

Conversion between CT format and DOT format (pseudoknots supported)

(1) From CT format to DOT format The conversion is very easy. Just check the values of column 1 and column 5:

  • The first seen pairs are treated as normal pairs.
  • The following conflict ones are treated as pseudoknots.

For example:

The following two structures may present the same structure but we prefer to use the first presentation.

(1) "...(((...[[[..)))...]]]..."
(2) "...[[[...(((..]]]...)))..."

Source code:

    def ct2dot(ctname):
        '''
        CT format to DOT format.
        Input can be CT file names or a CT file string.
        For pseudoknots like this:
            '(((..[[[..)))...]]]'
            The first seen pairs are labelled with '()', and the following conflict ones are labelled as '[]'
        '''
        structures = []
        scores = []
        try: # file name
            fh = open(ctname)
            cts = fh.readlines()
            fh.close()
        except: # string of CTs
            cts = ctname.split('\n')
        if cts[0].lower().find('energy') == -1: # No structure
            print >> sys.stderr, "No structure predicted by Fold (RNAStructure package)."
            return (structures, scores)
        score = 0
        idx = 0
        while True:
            if cts[idx].lower().find('energy') != -1:
                fields = cts[idx].split()
                cnt = int(fields[0]) # sequence length
                score = float(fields[3]) # energy
                cons = numpy.repeat('.', cnt) # structure
                idx += 1
                # read structure information
                bs = [cnt+1]
                for i in range(cnt):
                    a = i+1
                    b = int (cts[idx].split()[4])
                    idx +=1
                    if a > b:
                        try:
                            bs.remove(a)
                        except:
                            pass
                        continue
                    minr = min(bs)
                    if b > minr: # possible pseudoknots
                        cons[a-1] = '['
                        cons[b-1] = ']'
                    else:
                        cons[a-1] = '('
                        cons[b-1] = ')'
                        bs.append(b)
                structures.append(cons.tostring())
                scores.append(score)
            else:
                idx += 1
            if idx >= len(cts):
                break
        return (structures,scores)
    ct2dot=staticmethod(ct2dot)

(2) From DOT format to CT format We need a stack to achieve this. Iterate every character in DOT string,

  • if '.', pass
  • if '(', push the index into stack 1
  • if '[', push the index into stack 2
  • if ')', pop one index, named i, from stack 1. The current index is named j. Keep record of the pair <i,j> and <j,i>
  • if ']', pop one index, named i, from stack 2. The current index is named j. Keep record of the pair <i,j> and <j,i>

Finally we can generate the CT file according to the pairs we have got.

    def dot2ct(tfasts): #
        ''' Convert DOT format to CT format. '''
        ctstring = []
        for st, sc  in izip(tfasts.structures, tfasts.scores):
            # print header
            ctstring.append ("%5d  ENERGY = %-3.2f  %s" % (len(tfasts), sc, tfasts.name))
            stack1=[]
            stack2=[]
            pairs={}
            for i,c in enumerate(st):
                if c == '(':
                    stack1.append(i+1)
                elif c == '[':
                    stack2.append(i+1)
                elif c == ')':
                    pairs[i+1] = stack1.pop()
                    pairs[pairs[i+1]] = i+1
                elif c == ']':
                    pairs[i+1] = stack2.pop()
                    pairs[pairs[i+1]] = i+1
            for i in xrange(1,len(tfasts)+1): # ###24#A######23###25####0###24
                ctstring.append( " %4d %s %7d%4d %4d %4d" % (i, tfasts.seq[i-1], i-1, i+1, pairs.get(i,0),i) )
        return '\n'.join(ctstring)
    dot2ct=staticmethod(dot2ct)
Clone this wiki locally