-
Notifications
You must be signed in to change notification settings - Fork 247
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Inconistent lemmatization of English punctuation #997
Comments
Agreed, thanks - I would collapse em dash to hyphen and can implement that in the GU datasets to match en dash. The ellipsis behavior in GUM seems correct to me. For double and single quotes, GU corpora intentionally use straight single quote for and type of single quote, and two straight single quotes for any type of double quotes. This is maybe sort of Latex inspired. Some practical reasons for this early on were an old TreeTagger models that did that and the use of some XML tools that delimit annotations as double quotes attributes. Never using double straight quotes allows us to use these tools with worrying about escaping, and get only one kind of quotes in json files for convenience. If no one objects strongly I would like to keep this practice of only the plain straight quote in lemmas. |
I'd prefer double quotes (straight or curly) to have the |
I am old enough to remember TreeTagger and the PennTB decision to lemmatize quotes using the ``LaTeX escape codes''. I also remember I hated this decision even then - if there is a tool that cannot accept quotes on the input, we should write a wrapper/API for that tool that will do the de/escaping transparently, but we should not mess up the data. |
No, this was definitely not a call to conform with TreeTagger, just a historical explanation. As of right now, we do have tools in the pipeline that don't do proper XML escaping in annotations, so I hesitate changing the double quote lemma, though we can consider it as a longer term to do. Keep in mind though that if we lemmatize smart quotes to |
I'm not sure how this is related to XML escaping (CoNLL-U is not XML, XML requires escaping also other 4 characters, which are not escaped in GUM lemmas, the XML way of escaping quotes is
This is another question (relevant for this GitHub issue) and I am not so sure here. I would be OK with form=lemma for all punctuation. I can imagine disambiguating the opening and closing quotation mark in the lemma, i.e. using the curly/typographic (also called "smart quotes" because of the feature in word processors) as the lemma for straight quotes. That said, I can imagine also the opposite approach (straight quotes as lemma) motivated by users who search for any quotes (and don't know how to write a regex matching any quotes). |
From a purely UD standpoint it seems arbitrary to lemmatize quotes with an "escaped" spelling that wouldn't appear in most surface text as escaping is not an issue in the .conllu format. If there was a very strong tradition of doing this across the English NLP ecosystem (i.e. if all the modern lemmatizers mapped I'm not intimately familiar with the GUM pipeline, but would a short-term solution be to postprocess the .conllu file just when publishing it to the official UD repo? |
I agree, if I were doing this from scratch I would also choose double straight quotes as the lemma for all types of double quotes. And as I said above, I am also willing to change this in corpora we maintain at some point in the future, I just can't do it immediately due to the tools we work with right now. We could certainly convert things at a very late step before pushing to UD, but for the moment I would like to avoid that, since it would create a discrepancy between the general GUM repo and the UD version.
Yes, but some of the tools in our pipeline which process the lemmas have an interchange format of the type
I would rather fix the tool chain before implementing half solutions - the GUM build bot is pretty complex at this point, so it's not just about generating the UD conllu. TL;DR - I'll try to get this done for the next UD release, but can't simply push a fix right now. Moving this specific sub-issue to amir-zeldes/gum#176 |
Note that
Yes, if you can fix the tool itself, it is even better than writing a wrapper for it (I though these are 3rd-party tools).
My memory is bad (but I have an excuse - it is almost 20 years ago since I stopped using TreeTagger): I was perhaps not very happy with the TreeTagger decision of the LaTeX escapes in quotation mark lemmas, but what I hated was the decision of some taggers to accept only LRB and RRB as forms of left/right round bracket even if they didn't use the PennTB format anymore. Lemma could be considered an arbitrary id of a lexeme, after all. |
Well, essentially it's just a form of escaping, and it's necessary if you're using brackets to express a PTB tree in the native bracketing format. But I too prefer conllu ;)
True, though oddly ">" is allowed. Ampersand is needed due to entity replacement text, so that is sort of clear. But the format I'm talking about isn't actually XML - it's the CWB vertical format, which is a type of SGML with very specific restrictions (SGML is needed for various kinds of GUM markup due to nesting conflicts, where a proper XML alternative would be much messier) |
|
Looking at the lemmatization across the English treebanks, I've found some inconsistencies in the lemmatization of punctuation tokens between those treebanks:
...
(\u002E\u002E\u002E
) is lemmatized as is in EWT and PUD, but as…
(\u2026
) in GUM and GENTLE.…
(\u2026
) in lemmatized as is in GUM, GENTLE, and PUD. In EWT it is lemmatized as.
even in mid-sentence punctuation, which is an error..
whereas the other treebanks keep them as is."
,“
, and”
are lemmatized as"
(\u0022
) in EWT and PUD, but as''
(\u0027\u0027
) in GENTLE, GUM, and GUMReddit. This looks like a lemmatization error in the GUM/GENTLE treebanks.-
(u002D
) is lemmatized consistently as-
(u002D
) across the treebanks.–
(\u2013
) EN DASH is lemmatized consistently as-
(u002D
) across the treebanks.—
(\u2014
) EM DASH is lemmatized as-
(u002D
) in EWT, but as is in GENTLE. GUM, and PUD.--
(\u002D\u002D
), and---
(\u002D\u002D\u002D
) are lemmatized as is in EWT, but as-
(\u002D
) in GENTLE, GUM, and GUMReddit.-
(\u002D
) where it keeps the lemma as is. Some of these are mid-sentence punctuation (so would be candidates for a single-
(u002D
)) whereas others are the sole punctuation in sentences which would indicate their use as a section break rather than hyphenation to separate clauses.It would be good to have a unified consistent lemmatization across the treebanks for these.
The text was updated successfully, but these errors were encountered: