-
Notifications
You must be signed in to change notification settings - Fork 248
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Indicate a missing word due to typo #801
Comments
This gets us pretty close to the idea of "target hypotheses" in learner corpora... I wanted to point out that while this case might be straight forward, the more non-standard the data gets, the less agreement we will have on what a 'correct' version of the sentence should look like. There is a lot of literature on this in L2 studies, and I believe it turns out to be pretty complicated! |
Yeah, learner corpora (like https://github.com/UniversalDependencies/UD_English-ESL) will be a special case. I am not proposing to require all treebanks to correct for missing words, I am just wondering if there is a way for it to be indicated where a grammatically necessary word is clearly missing. We already have ways to indicate extra words due to speech repair, and ways to indicate misspelled words, so it seems natural to support some sort of encoding for missing words even if they would be leaves in the tree. |
One of the old slogans of UD was “Don't annotate things that are not there”, and even the enhanced UD guidelines, if followed strictly (including the final sentence that other enhancements should not appear in the released UD treebanks) don't provide means for capturing this. But I agree that some symmetry between being able to mark extra words and being able to show (clearly grammatically required) missing words would be nice. A MISC attribute at the nearest available ancestor in the tree would certainly not violate any guideline, it just can be a bit surprising for someone trying to interpret the A final note: I am thinking here about the means how to possibly encode the missing words, but I am not sure I know how to delimit the acceptable extent to what the means should be used. That is, where exactly lies the border of “clearly grammatically required”? |
An empty node would be most natural (because we could assign a lemma, features, etc.) as long as that doesn't interfere somehow with other expectations for the enhanced graph.
I think this should be left to the judgment of treebank creators as there will inevitably be gray area. My own preference is just to insert words that look like simple accidental omissions—not to edit intentionally terse phrasing (e.g. headlinese), and not to try to fully error-correct nonnative language. |
Supposing we add an empty node for the omitted word "will", what should its form be? Blank, since it doesn't appear at all (even as a copy) in the sentence? In that case we could specify |
I think I would also like more discretion on empty nodes, as this could be a nice way to encode target hypotheses for non-native data, which often differ in just a few words. But I agree it would be nice to be able to tell that the words are such 'corrections' and not gapping ellipses etc. So maybe enhanced nodes plus a special feature indicating that? |
A MISC attribute to distinguish various types of empty/abstract nodes. |
Is there a way to encode a missing word that is not a predicate, so that the normalized version of the sentence could be constructed?
For a sentence like "Soon I have time to finish it", we infer that there should be a "will" before "have".
The missing word policy refers to treating them like ellipsis, and the ellipsis policy says that nothing is done for words with no dependents. So does this mean an empty node should not be created for "will" in the enhanced graph, since it is a function word with no dependents? If it can't be incorporated into the graph, could it at least be included as a feature—maybe
CorrectForm=will have
for "have"?The text was updated successfully, but these errors were encountered: