Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Indicate a missing word due to typo #801

Open
nschneid opened this issue Jul 29, 2021 · 7 comments
Open

Indicate a missing word due to typo #801

nschneid opened this issue Jul 29, 2021 · 7 comments

Comments

@nschneid
Copy link
Contributor

Is there a way to encode a missing word that is not a predicate, so that the normalized version of the sentence could be constructed?

For a sentence like "Soon I have time to finish it", we infer that there should be a "will" before "have".

The missing word policy refers to treating them like ellipsis, and the ellipsis policy says that nothing is done for words with no dependents. So does this mean an empty node should not be created for "will" in the enhanced graph, since it is a function word with no dependents? If it can't be incorporated into the graph, could it at least be included as a feature—maybe CorrectForm=will have for "have"?

@amir-zeldes
Copy link
Contributor

This gets us pretty close to the idea of "target hypotheses" in learner corpora... I wanted to point out that while this case might be straight forward, the more non-standard the data gets, the less agreement we will have on what a 'correct' version of the sentence should look like. There is a lot of literature on this in L2 studies, and I believe it turns out to be pretty complicated!

@nschneid
Copy link
Contributor Author

Yeah, learner corpora (like https://github.com/UniversalDependencies/UD_English-ESL) will be a special case. I am not proposing to require all treebanks to correct for missing words, I am just wondering if there is a way for it to be indicated where a grammatically necessary word is clearly missing.

We already have ways to indicate extra words due to speech repair, and ways to indicate misspelled words, so it seems natural to support some sort of encoding for missing words even if they would be leaves in the tree.

@dan-zeman
Copy link
Member

One of the old slogans of UD was “Don't annotate things that are not there”, and even the enhanced UD guidelines, if followed strictly (including the final sentence that other enhancements should not appear in the released UD treebanks) don't provide means for capturing this.

But I agree that some symmetry between being able to mark extra words and being able to show (clearly grammatically required) missing words would be nice. A MISC attribute at the nearest available ancestor in the tree would certainly not violate any guideline, it just can be a bit surprising for someone trying to interpret the CorrectForm attributes that suddenly will have is the correct form of have, while in the correct sentence will would be a separate token. An empty node in the enhanced graph seems appealing, too. It is not supported in the guidelines but I think that some treebanks already have occasional non-predicate empty nodes (or even leaf empty nodes). As a matter of fact, the validator currently does not report them as an error. (It could report them if there is demand for it.)

A final note: I am thinking here about the means how to possibly encode the missing words, but I am not sure I know how to delimit the acceptable extent to what the means should be used. That is, where exactly lies the border of “clearly grammatically required”?

@nschneid
Copy link
Contributor Author

nschneid commented Aug 6, 2021

An empty node would be most natural (because we could assign a lemma, features, etc.) as long as that doesn't interfere somehow with other expectations for the enhanced graph.

I am not sure I know how to delimit the acceptable extent to what the means should be used. That is, where exactly lies the border of “clearly grammatically required”?

I think this should be left to the judgment of treebank creators as there will inevitably be gray area. My own preference is just to insert words that look like simple accidental omissions—not to edit intentionally terse phrasing (e.g. headlinese), and not to try to fully error-correct nonnative language.

@nschneid
Copy link
Contributor Author

nschneid commented Aug 6, 2021

Supposing we add an empty node for the omitted word "will", what should its form be? Blank, since it doesn't appear at all (even as a copy) in the sentence? In that case we could specify Typo=Yes and CorrectForm=will to signal that it's a typo-correction insertion.

@amir-zeldes
Copy link
Contributor

I think I would also like more discretion on empty nodes, as this could be a nice way to encode target hypotheses for non-native data, which often differ in just a few words. But I agree it would be nice to be able to tell that the words are such 'corrections' and not gapping ellipses etc. So maybe enhanced nodes plus a special feature indicating that?

@dan-zeman
Copy link
Member

plus a special feature indicating that

A MISC attribute to distinguish various types of empty/abstract nodes.

@dan-zeman dan-zeman modified the milestones: v2.11, v2.13 May 29, 2023
@dan-zeman dan-zeman modified the milestones: v2.13, v2.14 Nov 15, 2023
@dan-zeman dan-zeman modified the milestones: v2.14, v2.15 May 15, 2024
@dan-zeman dan-zeman modified the milestones: v2.15, v2.16 Nov 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants