Skip to content

Genes from proteome/species not descendant of Nx.tsv are present in Nx.tsv.  #602

Closed
@matrs

Description

Hello,
I'm trying to define single-copy orthogroups from the Nx.tsv files. i'm getting results that I consider confusing, so I wrote a couple of lines to check if a specific Nx.tsv has only genes pertaining to its descendants species, which I'm expecting. Let's say I take the N11.tsv, I see the descendants species of this node in the species tree and I see two:

['MGYG-HGUT-04532',
 'DGYMR06203__metabat2_low_PE']

Then, I loop over all the Nx.tsv files and I check the column MGYG-HGUT-04532 every time. I'm expecting to get genes only in the N11.tsv file and its ancestors:

[Tree node 'N7' (0x7f514471e49),
 Tree node 'N3' (0x7f5147961be),
 Tree node 'N1' (0x7f51478373a),
 Tree node 'N0' (0x7f514471e46)]
nodes = [f'N{n}.tsv' for n in range(194)]
for n in nodes:
    n_df = pd.read_csv(root.joinpath(n), sep='\t', na_filter=False)
    print(n, n_df.loc[:, 'MGYG-HGUT-04532'].unique(), sep='\n')

Which produces:

N0.tsv
['' 'GFNMCGMP_00924, GFNMCGMP_01074, GFNMCGMP_01611'
 'GFNMCGMP_00164, GFNMCGMP_00168' ... 'GFNMCGMP_00320' 'GFNMCGMP_00321'
 'GFNMCGMP_00381, GFNMCGMP_00380']
N1.tsv
['' 'GFNMCGMP_00924, GFNMCGMP_01074, GFNMCGMP_01611'
 'GFNMCGMP_00164, GFNMCGMP_00168' ... 'GFNMCGMP_00320' 'GFNMCGMP_00321'
 'GFNMCGMP_00381, GFNMCGMP_00380']
N2.tsv
['']
N3.tsv
['' 'GFNMCGMP_00924, GFNMCGMP_01074, GFNMCGMP_01611'
 'GFNMCGMP_00164, GFNMCGMP_00168' ... 'GFNMCGMP_00320' 'GFNMCGMP_00321'
 'GFNMCGMP_00381, GFNMCGMP_00380']
N4.tsv
['']
N5.tsv
['']
N6.tsv
['']
N7.tsv
['' 'GFNMCGMP_00924' 'GFNMCGMP_01074' ... 'GFNMCGMP_00320'
 'GFNMCGMP_00321' 'GFNMCGMP_00381, GFNMCGMP_00380']
N8.tsv
['']
N9.tsv
['']
N10.tsv
['']
N11.tsv
['' 'GFNMCGMP_00750, GFNMCGMP_00293' 'GFNMCGMP_00570'
 'GFNMCGMP_01197, GFNMCGMP_00667' 'GFNMCGMP_00341'
...]
N12.tsv
['' 'GFNMCGMP_00924' 'GFNMCGMP_01074' ... 'GFNMCGMP_01376'
 'GFNMCGMP_01174' 'GFNMCGMP_00331']
N13.tsv
['']
N14.tsv
['']
... empty lists
['']
N20.tsv
['' 'GFNMCGMP_00924' 'GFNMCGMP_01074' ... 'GFNMCGMP_01376'
 'GFNMCGMP_01174' 'GFNMCGMP_00331']
N29.tsv
['' 'GFNMCGMP_00924' 'GFNMCGMP_01074' ... 'GFNMCGMP_01376'
 'GFNMCGMP_01174' 'GFNMCGMP_00331']
...
followed by  empty lists

So N12, N20 and N29.tsv show genes for MGYG-HGUT-04532, although none of these nodes are descendants/ancestors of N11. I tried with other species and nodes, but It's always the same. Maybe I'm misunderstanding how this works and I'd appreciate any help. I'm attaching the tree file and a couple of Nx.tsv.

I'm running orthofinder 2.5.2

Jose Luis

SpeciesTree_rooted_node_labels.txt

Ns.zip

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions