Description
Hello,
I'm trying to define single-copy orthogroups from the Nx.tsv
files. i'm getting results that I consider confusing, so I wrote a couple of lines to check if a specific Nx.tsv
has only genes pertaining to its descendants species, which I'm expecting. Let's say I take the N11.tsv
, I see the descendants species of this node in the species tree and I see two:
['MGYG-HGUT-04532',
'DGYMR06203__metabat2_low_PE']
Then, I loop over all the Nx.tsv
files and I check the column MGYG-HGUT-04532
every time. I'm expecting to get genes only in the N11.tsv
file and its ancestors:
[Tree node 'N7' (0x7f514471e49),
Tree node 'N3' (0x7f5147961be),
Tree node 'N1' (0x7f51478373a),
Tree node 'N0' (0x7f514471e46)]
nodes = [f'N{n}.tsv' for n in range(194)]
for n in nodes:
n_df = pd.read_csv(root.joinpath(n), sep='\t', na_filter=False)
print(n, n_df.loc[:, 'MGYG-HGUT-04532'].unique(), sep='\n')
Which produces:
N0.tsv
['' 'GFNMCGMP_00924, GFNMCGMP_01074, GFNMCGMP_01611'
'GFNMCGMP_00164, GFNMCGMP_00168' ... 'GFNMCGMP_00320' 'GFNMCGMP_00321'
'GFNMCGMP_00381, GFNMCGMP_00380']
N1.tsv
['' 'GFNMCGMP_00924, GFNMCGMP_01074, GFNMCGMP_01611'
'GFNMCGMP_00164, GFNMCGMP_00168' ... 'GFNMCGMP_00320' 'GFNMCGMP_00321'
'GFNMCGMP_00381, GFNMCGMP_00380']
N2.tsv
['']
N3.tsv
['' 'GFNMCGMP_00924, GFNMCGMP_01074, GFNMCGMP_01611'
'GFNMCGMP_00164, GFNMCGMP_00168' ... 'GFNMCGMP_00320' 'GFNMCGMP_00321'
'GFNMCGMP_00381, GFNMCGMP_00380']
N4.tsv
['']
N5.tsv
['']
N6.tsv
['']
N7.tsv
['' 'GFNMCGMP_00924' 'GFNMCGMP_01074' ... 'GFNMCGMP_00320'
'GFNMCGMP_00321' 'GFNMCGMP_00381, GFNMCGMP_00380']
N8.tsv
['']
N9.tsv
['']
N10.tsv
['']
N11.tsv
['' 'GFNMCGMP_00750, GFNMCGMP_00293' 'GFNMCGMP_00570'
'GFNMCGMP_01197, GFNMCGMP_00667' 'GFNMCGMP_00341'
...]
N12.tsv
['' 'GFNMCGMP_00924' 'GFNMCGMP_01074' ... 'GFNMCGMP_01376'
'GFNMCGMP_01174' 'GFNMCGMP_00331']
N13.tsv
['']
N14.tsv
['']
... empty lists
['']
N20.tsv
['' 'GFNMCGMP_00924' 'GFNMCGMP_01074' ... 'GFNMCGMP_01376'
'GFNMCGMP_01174' 'GFNMCGMP_00331']
N29.tsv
['' 'GFNMCGMP_00924' 'GFNMCGMP_01074' ... 'GFNMCGMP_01376'
'GFNMCGMP_01174' 'GFNMCGMP_00331']
...
followed by empty lists
So N12
, N20
and N29.tsv
show genes for MGYG-HGUT-04532
, although none of these nodes are descendants/ancestors of N11. I tried with other species and nodes, but It's always the same. Maybe I'm misunderstanding how this works and I'd appreciate any help. I'm attaching the tree file and a couple of Nx.tsv
.
I'm running orthofinder 2.5.2
Jose Luis