-
Notifications
You must be signed in to change notification settings - Fork 0
oformat
Detailed definition can be found here: http://projects.binf.ku.dk/pgardner/bralibase/RNAformats.html. Commonly used formats are:
- Connect format (CT)
- Dot-bracket format
- Stockholm Format
Columns 1, 3, 4, and 6 redundantly give sequence indices, the informative columns 2 and 4 give the sequence and 'j' in position 'i' if (i,j) is a base-pair, otherwise this is zero. One could envisage encoding multiple (aligned) sequences and structures in this format by alternating sequence and structure columns in the one file.
Example:
73 ENERGY = -17.50 S.cerevisiae_tRNA-PHE 1 G 0 2 72 1 2 C 1 3 71 2 3 G 2 4 70 3 4 G 3 5 69 4 5 A 4 6 68 5 6 U 5 7 67 6 7 U 6 8 66 7 8 U 7 9 0 8 . . . 66 A 65 67 7 66 67 A 66 68 6 67 68 U 67 69 5 68 69 U 68 70 4 69 70 C 69 71 3 70 71 G 70 72 2 71 72 C 71 73 1 72 73 A 72 74 0 73
Matching parentheses in positions 'i' and 'j' indicate a base-pair, otherwise a '.' is used. Many people complain that this format cannot represent pseudo-knots in an un-ambiguous fashion, however using additional parenthese types '[', ']', '{', '}', '<', '>', 'A', 'a', 'B', 'c', 'C', ... one can represent extremely high order knots in an un-ambiguous fashion. Alternatively, as Sean Eddy discusses in his Infernal documentation these can be used to 'mark-up' the structure to discriminate between different loop types.
>S.cerevisiae_tRNA-PHE M10740/1-73 GCGGAUUUAGCUCAGUUGGGAGAGCGCCAGACUGAAGAUUUGGAGGUCCUGUGUUCGAUCCACAGAAUUCGCA (((((((..((((........)))).((((.........)))).....(((((.......)))))))))))). (-17.50)
(1) From CT format to DOT format The conversion is very easy. Just check the values of column 1 and column 5:
- column 5 is 0, print '.'
- column 5 is not 0 and larger than column 1, print '('
- column 5 is not 0 and smaller than column 1, print ')'
Source code:
def ct2dot(ctname):
'''
CT format to DOT format.
Input can be CT file names or a CT file string.
'''
structures = []
scores = []
try: # file name
fh = open(ctname)
cts = fh.readlines()
fh.close()
except: # string of CTs
cts = ctname.split('\n')
if 'ENERGY' not in cts[0]: # No structure
print >> sys.stderr, "No structure predicted by Fold (RNAStructure package)."
return structures
score = 0
for line in cts:
if 'ENERGY' in line: # Parse header
if score != 0:
structures.append(lstr.tostring())
scores.append(score)
fields = line.split()
cnt = int(fields[0]) # sequence length
score = float(fields[3]) # energy
lstr = numpy.repeat('.', cnt) # structure
i = 0
else: # Read structure information
s = int(line.split()[4]) # the first pairing index in CT format.
i += 1
if s != 0:
lstr[i] = i < s and '(' or ')'
if score != 0:
structures.append(lstr.tostring())
scores.append(score)
return (structures,scores)
ct2dot=staticmethod(ct2dot)
(2) From DOT format to CT format We need a stack to achieve this. Iterate every character in DOT string,
- if '.', pass
- if '(', push the index into stack
- if ')', pop one index, named i, from stack. The current index is named j. Keep record of the pair <i,j> and <j,i>
Finally we can generate the CT file according to the pairs we have got.
Source code:
ct2dot=staticmethod(ct2dot)
def dot2ct(tfasts): #
''' Convert DOT format to CT format. '''
ctstring = []
for st, sc in izip(tfasts.structures, tfasts.scores):
# print header
ctstring.append ("%5d ENERGY = %-3.2f %s" % (len(tfasts), sc, tfasts.name))
pairs = {}
stack = []
for i, c in enumerate(st):
if c == '(':
stack.append(i+1)
elif c == ')':
pairs[i+1] = stack.pop()
pairs[pairs[i+1]] = i+1
for i in xrange(1,len(tfasts)+1):
ctstring.append( " %4d %s %7d%4d %4d %4d" % (i, tfasts.seq[i-1], i-1, i+1, pairs.get(i,0),i) )
return '\n'.join(ctstring)
dot2ct=staticmethod(dot2ct)
(1) From CT format to DOT format The conversion is very easy. Just check the values of column 1 and column 5:
- The first seen pairs are treated as normal pairs.
- The following conflict ones are treated as pseudoknots.
For example:
The following two structures may present the same structure but we prefer to use the first presentation.
(1) "...(((...[[[..)))...]]]..." (2) "...[[[...(((..]]]...)))..."
Source code:
def ct2dot(ctname):
'''
CT format to DOT format.
Input can be CT file names or a CT file string.
For pseudoknots like this:
'(((..[[[..)))...]]]'
The first seen pairs are labelled with '()', and the following conflict ones are labelled as '[]'
'''
structures = []
scores = []
try: # file name
fh = open(ctname)
cts = fh.readlines()
fh.close()
except: # string of CTs
cts = ctname.split('\n')
if cts[0].lower().find('energy') == -1: # No structure
print >> sys.stderr, "No structure predicted by Fold (RNAStructure package)."
return (structures, scores)
score = 0
idx = 0
while True:
if cts[idx].lower().find('energy') != -1:
fields = cts[idx].split()
cnt = int(fields[0]) # sequence length
score = float(fields[3]) # energy
cons = numpy.repeat('.', cnt) # structure
idx += 1
# read structure information
bs = [cnt+1]
for i in range(cnt):
a = i+1
b = int (cts[idx].split()[4])
idx +=1
if a > b:
try:
bs.remove(a)
except:
pass
continue
minr = min(bs)
if b > minr: # possible pseudoknots
cons[a-1] = '['
cons[b-1] = ']'
else:
cons[a-1] = '('
cons[b-1] = ')'
bs.append(b)
structures.append(cons.tostring())
scores.append(score)
else:
idx += 1
if idx >= len(cts):
break
return (structures,scores)
ct2dot=staticmethod(ct2dot)
(2) From DOT format to CT format We need a stack to achieve this. Iterate every character in DOT string,
- if '.', pass
- if '(', push the index into stack 1
- if '[', push the index into stack 2
- if ')', pop one index, named i, from stack 1. The current index is named j. Keep record of the pair <i,j> and <j,i>
- if ']', pop one index, named i, from stack 2. The current index is named j. Keep record of the pair <i,j> and <j,i>
Finally we can generate the CT file according to the pairs we have got.
def dot2ct(tfasts): #
''' Convert DOT format to CT format. '''
ctstring = []
for st, sc in izip(tfasts.structures, tfasts.scores):
# print header
ctstring.append ("%5d ENERGY = %-3.2f %s" % (len(tfasts), sc, tfasts.name))
stack1=[]
stack2=[]
pairs={}
for i,c in enumerate(st):
if c == '(':
stack1.append(i+1)
elif c == '[':
stack2.append(i+1)
elif c == ')':
pairs[i+1] = stack1.pop()
pairs[pairs[i+1]] = i+1
elif c == ']':
pairs[i+1] = stack2.pop()
pairs[pairs[i+1]] = i+1
for i in xrange(1,len(tfasts)+1): # ###24#A######23###25####0###24
ctstring.append( " %4d %s %7d%4d %4d %4d" % (i, tfasts.seq[i-1], i-1, i+1, pairs.get(i,0),i) )
return '\n'.join(ctstring)
dot2ct=staticmethod(dot2ct)
© 2017, Yunfei Wang (yfwang0405ATgmail.com), The University of Texas at Dallas
Installation
Data processing
Data format
IO