-
Notifications
You must be signed in to change notification settings - Fork 245
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
English PTB to UD 2.0 #717
Comments
You can use CoreNLP to convert PTB brackets for English to UD v1 (more or less, I think it represents a particular moment in time before 2.0 was released, but fairly close to v1 still), like this:
If you have a good conversion to Stanford Dependencies, you can also use DepEdit to convert the data to the current UD standard, more or less accurately depending on whether you have some additional entities (e.g. entities to resolve flat/compound better, etc.). This process is described and evaluated in this paper: https://www.aclweb.org/anthology/W18-4918/ Finally you can also use a quick and dirty UD1>UD2 DepEdit script to transform the CoreNLP output from the command above to the current guidelines, but there are certain to be errors if you don't have the additional annotations from the paper. This basically just renames the labels that were changed in V2, rewires cc+conj, etc.:
If you want the code from the paper, let me know, but it is probably not 100% runnable out of the box (hardwired paths etc.) |
Since CoreNLP v4.0.0, the converter actually outputs UDv2! You can run it, as suggested by Amir, using the command:
|
Just to let people know... I got some errors when I run the UD validation script on the output data produced by CoreNLP 4.0 over the https://catalog.ldc.upenn.edu/LDC2013T19 dataset. the top 15 most frequent errors are:
|
Any update on CoreNLP's PTB->UD conversion producing invalid UD? @sebschu @manning @AngledLuffa |
that looks like a project! i will find time this year to start chipping away at that, but there's some work i simply can't put off any longer as i promised it for an upcoming industry event |
actually, one way to speed this up would be to suggest a few command lines for doing the validation |
I think this should run validation for EWT:
|
Drilling down a bit into the most common error, that of a
Our POS tag converter code has a comment: https://github.com/stanfordnlp/CoreNLP/blob/main/src/edu/stanford/nlp/trees/UniversalPOSMapper.java
and that part of the conversion being commented out results in the tag
whereas the UD version of that sentence is
First there's a somewhat unfortunate DRY violation here, in that the same rules are repeated in the tsurgeon file and in the constituency -> dependency converter rules: So I'll need to figure out how extensive that problem is and how best to resolve it. There have been a few dependency converter fixes over the years which I assume are not reflected in any way in the POS converter. I also need to figure out how or why this particular rule about The other errors probably have similar origins when it comes to UPOS tags being flagged by the validator. They'll each require some individual attention regarding what kind of tree is causing the error and how to fix. |
…y trees to dependencies. UniversalDependencies/docs#717
for my own reference, i've been doing this to check a single tree:
or this for an entire slice of PTB:
So here's the next phrase in the dev set which isn't a
Our converter turns this into
The error given is
however, I can find this sentence in EWT which has a similar structure
so that's, quoting the French treebanks this time, kind of BS although I do notice one difference, that "kind of ___" is fixed, as opposed to our converter, which turned "sort of ___" into Editing the dependencies to make that a Continuing to dig into this, the converter has another component which breaks out hey, as it turns out, there's already a thing which does
So this fix is actually rather simple, aside from all the spelunking needed. Just need to turn that |
Here are some similar examples in EWT:
Also looking through GUM a bit, it looks like this should be case? But I'm not 100% convinced that's correct. Any suggestions on what to do would be welcome. |
This is because the converter gets a
If I look around for possibly similar usages of
However, I'm not sure this is 100% indicative, as those usages of
I like those examples more, and they seem to suggest Digging deeper and looking at
Effectively, once again, I have no idea what the ultimate resolution of this structure should be. Hopefully this is somewhat illustrative as to why there is very little movement over time for this issue: there are probably zero people in the world in the center of the Venn diagram of "understands the converter", "feels comfortable making authoritative decisions about dependencies". and "has the time to make these changes" |
I am happy to weigh in to clarify the UD annotation policies. :) It is not surprising that this will be a nontrivial change as in the last couple of years there have been some notable general guidelines changes, some major revisions of English-specific policies (like relative clauses, pronouns, and passives), and hundreds of smaller corrections and policy changes. Some will be reflected in the main UD validator, and others are checked in English-specific validation scripts. You are quite right that I've responded to your question about "this time around" in UniversalDependencies/UD_English-GUM#81. My gut feeling for "double the price" is advmod. nummod should be limited to actual numbers.* Is it possible to change the QP rule to check for a number (tagged NUM)? (* An exception: Currently ordinal dates e.g. "February 28th" have NOUN/nummod to attach the date to the month but this needs to be changed.) |
See my response on "around" in UniversalDependencies/UD_English-GUM#81 I think in "received double the price", "double" is
Interrogative test:
|
That's probably true, but there are perhaps more grad students with ML skills who might be persuaded to work on postediting the converter output based on trying to match the final UD product in a corpus like EWT... I would actually think that an ML step might be needed anyway for really good results, since UD trees express some things that PTB trees just don't distinguish. |
I think part of the appeal of this converter is that it is fast, whereas as using an ML step to convert the trees would be orders slower. Certainly I would expect it to be more accurate, though. |
IN vs RB vs RP in PTB is also giving me headaches for various short phrases. For example, This leads to an error
in the phrase
|
Is |
I think so, but I don't think the converter is the right place to editorialize PTB tags. Perhaps there's some room to apply some heuristics such as a singleton ADVP is treated as a particle in the "go down", "take down", "drive down" senses... I do wonder how easy it will be to distinguish servers and coal miners going down, though, or the sentence "If you're not busy, why not drive down this weekend?" |
Yeah this is why I don't like the idiomaticity criterion. Probably best to trust the Penn tree and live with the occasional stray validator error caused by a Penn error. |
In terms of
Note the inconsistent tagging. I'd like to throw the PTB into space... but I do like fixing trivial errors in large projects |
"en masse" is a good one. Not |
Whatever heuristics I have developed to understand these things, they are failing me in this interpretation of "en masse" as not being a grammatical expression. Would you clarify that a little bit? Also, to be clear, |
This seems like a reasonable argument. It's just a situation with another unfixable validator error after using the converter on PTB. I don't mind, since there are a lot of unfixable errors at this point anyway
I would expect the tag changes to lists won't require more linguistic knowledge than I have to implement - what's the draft proposal look like? |
Indeed, and they segment "gonna" into "gon" + "na", as we do as well (15/15 times), so I think colloquial contractions like this should generally be broken up. |
Agreed on that, but if a user gives the converter a tree with one of these contractions as a single word, I think it would be incorrect for the converter to split it for them. Similar to my current belief that it should return the same XPOS the user gives (via the tree) and the UPOS should correspond to the XPOS, even if that means the dependencies created violate the validator's rules about UPOS. So basically there's a whole bunch of errors the validator will flag on the output of the converter when given PTB unless we start editing the input trees in ways which users would find surprising |
Yeah, that all sounds right. In terms of silencing the validator about the results of that, I wouldn't lose too much sleep over it if you want to do it, but I sort of find it right if the validator throws a warning, since the output indeed does not correspond to the recommended UD English standard, so users should be warned. |
Seems like we have different expectations of what "a converter to UD" is. FWIW, in my converter of Czech PDT to UD, I want the output to be as good/valid UD as possible given the input. I am not trying to output something that will be as close as possible to the original PDT, just with some UD labeling in places where it does not hurt feelings of users who live outside UD. |
Well, we do have a PTB Correcting script which fixes up a bunch of known errors (mostly tags, but could include retokenizing |
ps. not that I was offended by your wording, but "hurt feelings" or unmatched expectations tend to express themselves in the form of github issues |
How well-established is this convention in Penn data? Is it just a one-off where they neglected to tokenize "mighta" or is it a repeated thing? I couldn't find other tokens in OntoNotes but not sure if I was searching correctly. If it only comes up once or twice in the Penn trees then I don't think UD should necessarily enact a policy just to accommodate that. But if it's a clear policy of PTB then we should be prepared to either convert or accommodate that tokenization in UD. |
Segmenting "gonna" into "gon" + "na" has to be justified. We have already discussed this in #1006. If we look at all the realisations of the lexeme TO (lemma=to) in UD_English-GUM (and if we exclude orthographic variations), we have 4 realisations (gon-na, ought-a, got-ta). It would be probably more justified to consider that TO has only two allomorphes, to and a. But the real problem is that we don't have any criterion to decide how to segment this kind of words in UD and how to choose the form of their parts. |
Maybe - it wouldn't have been hard to do "gonn a" if we were doing this from scratch, but since Penn corpora already went with "gon na", I don't mind having a third form too much. At least we're consistent with other English corpora this way. |
Plus, as a practical matter, less risk of POS taggers mislabeling the "a" as a determiner! |
could always split it as splitting it as |
… in the PTB conversion to dependencies by about 250. Weirdly this is by removing 280 syntax errors and adding 40 morpho errors for aux verbs. Presumably those should be fixable. Of course, there is always more that can be done - there are now 2622 errors left when using the converter. UniversalDependencies/docs#717
i spent longer than necessary fixing up
One sentence that still goes wonky from PTB is the following:
In this case, I believe I can make a new release of CoreNLP which greatly reduces the number of errors in converted PTB once I wrap up this tiny change, but completely eliminating them with a deterministic converter is optimistic |
... ultimately I don't see a difference in the verb usage in the following sentences, but I'm happy to be told how to count the angels dancing on this pin:
Compare to the following, which is a condition or something the NP had done to it rather than something the NP did:
|
These should all be VBD. PTB has a lot of tagger errors that annotators missed. |
Good, glad to know it wasn't me misunderstanding. Thanks for checking |
Are we happy with the conversion of There are quite a few trees in PTB which have the
|
Yes that's correct, it was specifically implemented it in an earlier EWT release: UniversalDependencies/UD_English-EWT#168 |
In terms of
and for these I should probably get stuff like
Basically it seems to be
That looks pretty similar ... but the
|
These are good questions. I am not an expert on how PTB uses QPs and have been frustrated at the lack of documentation on the UD treatment of these kinds of constructions. Basically, it seems to me some simple principles are:
You are right that semantically, "half the students" and "half of the students" are very similar, but the second involves a PP so syntactically speaking, they have different heads. |
Related question: what to do about
My exploration of EWT has found something kind of similar...
Another similar example in PTB:
whereas the PTB revision changes it to
But of course in typical PTB style this gets annotated differently elsewhere, such as
So, just in general, we want this pattern?
|
|
(sorry for the repeated small messages) also, this one is a bit different, with an
I believe |
Yes agree with all these suggestions |
Digging into one of the many tiny cases left, there's a tree which sounds a bit like Yoda:
In this case, I expect the correct dependencies would be I have a change which fixes that one tree (and no others in PTB) |
Yup! This is a subject-dependent inversion. |
There was only one of them in PTB, interestingly. Maybe it was the only one with that particular parse. I came across an oddity in our converter when fixing that one... apparently the results can be different depending on the object identity of the dependency objects, which changed when I created new objects to resolve that dependency. Long story short, in the following sentence, where should
the two candidates which our converter produces are either it should attach to |
Also,
(edit: sometimes |
|
Does anyone know that is the best approach to convert a treebank in PTB format to UD 2.0? I found the page https://nlp.stanford.edu/software/stanford-dependencies.html, but it is not clear if the code supports UD 2.0. Suggestions are welcome.
The text was updated successfully, but these errors were encountered: