Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Non-exact equivalences #159

Open
bgyori opened this issue Dec 9, 2020 · 3 comments
Open

Non-exact equivalences #159

bgyori opened this issue Dec 9, 2020 · 3 comments

Comments

@bgyori
Copy link
Member

bgyori commented Dec 9, 2020

I found that there is a large number of equivalences in equivalences.csv that are not exact matches, for instance, in the case of InterPro mappings. As an example, take FPLX:Hedgehog which is mapped to 6 different InterPro entries.
One that looks exact is https://www.ebi.ac.uk/interpro/entry/InterPro/IPR001657/ (Hedgehog protein) but the others include e.g., https://www.ebi.ac.uk/interpro/entry/InterPro/IPR001767/ (Hedgehog protein, Hint domain) and https://www.ebi.ac.uk/interpro/entry/InterPro/IPR003586/ (Hint domain C-terminal) which, I don't think should be considered equivalences. I suspect that these might have been added with the goal of adding as many IP->FPLX mappings as possible from sources that produce various groundings in InterPro. Still they are misleading if interpreted in the opposite direction.

@johnbachman
Copy link
Member

Interesting. In the case of InterPro mappings, most (all?) of them were automatically generated using the script import/interpro_mappings.py which looked at the gene-level members and created mappings only if the members were an exact match between FamPlex and InterPro. In the code:

if jacc >= jaccard_cutoff:

(when the mappings were added we used the default Jaccard index threshold of 1).

@bgyori
Copy link
Member Author

bgyori commented Dec 9, 2020

I see, that makes sense in the sense that the same set of Hedgehog proteins could have a "Hint domain C-terminal", still, semantically that probably shouldn't be curated as an equivalence. What if we differentiated family and domain entries in Interpro and only added family equivalences?

@johnbachman
Copy link
Member

That would definitely make sense if we could get that information systematically.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants