-
Notifications
You must be signed in to change notification settings - Fork 12
Node equivalence under annotation: examples
Suppose we have two similar taxonomic or phylogenetic trees (perhaps successive versions of OTT), each with its own set of nodes. Nodes are linked to other nodes by parent/child relations, and are provided with name, rank, various identifiers, and so on. From this it follows that no single node can belong to two trees; the containing tree is a property of the node.
We wouldn't make trees if not for the fact that they can be interpreted as saying something about biology. It is useful to relate trees to one another, because by doing so, we may be able to make biological inferences that are not made by either tree alone.
Although there are various ways to relate trees, a particularly important one is to say that a node in one is "equivalent" to a node in the other, i.e. they have the same or similar biological meaning. For example, a node labeled "Pan troglodytes" in one tree might be seen as equivalent to one with the same label in another tree. Labeling is not enough due to homonyms, but may be combined with other evidence to assess the possibility of equivalence.
Equivalence judgments are critical. A "bad" judgment of equivalence can lead to inappropriate annotation transfer (e.g. judging something extinct that is extant). But passing over an equivalence that would have been 'correct' can lead to valuable node annotations being lost, requiring rediscovery later.
An identifier system is a mechanism for assessing equivalence. Equivalent nodes get the same id (e.g. OTT id), and inequivalent ones get different ids.
Here are some equivalence puzzles. The challenge is that we want to be able to use some kind of software inference engine to assess equivalence.
GBIF puts Bullacta ecarata in Decapoda, but it properly belongs in Mollusca (as you can tell by various lines of evidence). This is a mistake in the way GBIF was assembled, not a true homonym. When we correct this, the membership of both groups change. Since this is a mistake that surely nobody depends on, it seems highly desirable that the Decapoda nodes in the two trees be considered equivalent, and similarly the Mollusca nodes.
Homonyms also create potential for errors that ought to be correctable without disturbing equivalence relations involving other nodes. For example, a node N in T that uniquely has name A in T could be considered equivalent to N' in T' that uniquely has name A in T', but consideration of relationships (names of other nodes), perhaps in conjunction with literature or the application of common sense, may determine later that N and N' actually shouldn't be equivalent (i.e. annotations shouldn't transfer). Such a correction ideally shouldn't lead to "too much" annotation loss.
Pheidole is a huge genus and species are added to all the time. Presumably the name is associated (in the literature, or in biologists' minds) with an apomorphy; however, the apomorphy and the character data needed to make use of it are not available to the engine. Suppose we have a tree T with a node labeled Pheidole but none for new species Pheidole x. We make a new tree T' similar to T, but with a node for Pheidole x descended from its Pheidole node. It is desirable to judge the Pheidole node in T equivalent to the one in T', even though the two Pheidole nodes have different descendant sets. This is because any biological annotation of the one node will apply equally (i.e. be interpreted the same way) to the other node. Important: a new species can "become" a member of Pheidole even if it is the sister to all other Pheidole; this is because it's highly desirable to define 'Pheidole' (as a node label) to be defined by apomorphy, not by appeal to the details of T or T'.
"Splitting" an existing species is similar.
A node is often deleted because it corresponds to an invalid name or because the associated description is so poor that the name cannot be used for alpha taxonomy. Merging or "lumping" for biological or clerical reasons is similar. These processes are the reverse of the monotonic growth case.
Another common case is renaming. This can appear to the engine like a deletion followed by an addition and, again, shouldn't be enough reason to invalidate existing annotations.
Cicindellidae and Carabidae: Earlier taxonomies have these as siblings, while more modern taxonomies rename Cicindellidae to Cicindellinae and put it as a subfamily of Carabidae. Again, is Carabidae without Cicindellidae "different" from Carabidae with Cicindellidae from the point of view of annotations? This case is different from the Isoptera case, because Cicindellidae becomes a child of Carabidae, as opposed to being deeply embedded in it. That means that the topologies are consistent in the sense that you could have a taxonomy with sisters Cic. and Carabidae-without-Cic. and parent Carabidae-with-Cic. - i.e. both versions of Carabidae could be monophyletic according to a single tree. Yet it's still possible that most or all annotations attached to a 'Carabidae' node might be true of either Carabidae.
Certainly annotations on the parent of Carabidae should unaffected by what is mere rearrangement of its descendants.
Semionotiformes and Lepisosteiformes - similar to Carabidae. There seem to be two Semionotiformes, one that contains Lepisosteiformes and one that doesn't. In this case we have an annotation that's true of one, but because the two versions were mistakenly considered the same, the annotation incorrectly got applied to the other (issue).
Rozella, Microsporidia, Fungi - classifications differ as to whether Rozella and Microsporidia fall under Fungi. But is this because of arguments over whether the organisms satisfy some apomorphy associated with the name Fungi, or is it because different authors want to use 'Fungi' in different ways? There is no practical way for the engine to know this. Either decision - changing the claimed membership of Fungi without changing its identity, or splitting the 'identity' of 'Fungi' so that there are 'Fungi with Microsporidia' and 'Fungi without Microsporidia' - could have widespread ramifications with respect to annotations, but on the other hand, there are many annotations that are insensitive to these distinctions.
Isoptera and Blattodea: Classical taxonomies usually put these as sibling orders, but phylogenetic progress shows that Isoptera is actually buried deep in Blattodea. That is, T1 has its Isoptera node not under the Blattodea node, while T2 its Isoptera node under its Blattodea node. The trees seem to make contradictory biological claims. If T1 is "current" and I make an annotation about Blattodea, which gets attached to Blattodea in T1, should that annotation transfer to Blattodea in T2? Common practice is that we say we were talking about cockroaches (which have the 'cockroach apomorphy') in both T1 and T2, so annotations transfer, but that we were "wrong" about whether termites were cockroaches (whether they possess that same 'cockroach apomorphy'), and we're just correcting our mistake. If the annotation depended on that mistake, well tough luck, since it was wrong anyhow. ... But this really depends on what our expectations were on associating the annotation with T1's Blattodea node, and what commitments we thought we were getting from our annotation engine.
Eukaryota and Archaea: similar to the Isoptera case. Is Archaea without Eukaryota different from Archaea with Eukaryota? Or is there no such thing as Archaea without Eukaryota, because we were mistaken about the assignment of members to that group, and/or because it would be paraphyletic?
Another example is Aves and Saurischia.
Sometimes a species or group could appear to "move" from inside a child of some node R to inside another child of R. Sometimes we would like to discard annotations on the two children, but perhaps the children of R have "enough identity" - think apomorphy, or phylocode-like definitions - to "survive" the departure and arrival of the group.
- Hyacinthoides hispanica in Liliopsida moving from Liliales to Asparagales (single species moving order to order)
- Helminthora stricta in Nemaliales moving from Helminthocladiaceae to Liagoraceae (from one family to adjacent family)
The risk is that all annotations on higher groupings would be lost if the loss (or gain) of the 'moving' species caused inequivalence those higher groupings.
For more examples, we have twenty or so instructions from Open Tree curators saying "X belongs in Y" (as opposed to where it was when they looked at the tree), e.g. Quiscalus belongs in Icteridae, not Fringillidae.
Equivalence may not be all or nothing; it depends on what kind of annotation we're talking about. We could use partial equivalence, also known as versioning, to distinguish different degrees of node similarity. Perhaps some annotations could be automatically transferred from 14.6 to 14.7, while others couldn't, or would require manual review before being transferred.
I think the deeper question is: When an annotator (e.g. someone providing an OTU mapping, or extinctness determination) is providing information, what are they offered by way of explanation of how the annotation will be transferred in the future? If they are given a rule, such as "your annotation will be applied to any node with the following properties: ..." or "your annotation will be applied according to the following algorithm", then they can judge for themselves whether the annotation they are about to provide is one they are going to stand by. That is - rather than looking at when annotations should transfer or not, decide ahead of time on the rule for annotation transfer, and allow that to control the application of annotations in the first place. The details of the rule then matter less, and the rule may be chosen the right compromise between human comprehension and automation.
A curator might be presented with a menu of 'identifiers' each with its own annotation transfer condition sets. They could look to see which is the minimal set that enables application of their annotation and choose that for annotation deposit purposes. If none is appropriate, they could make their own, and possibly register it for use by others.