-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MWT and sentence text have incorrect final letters not found in tokens #27
Comments
If the error is in the source text I assume it should be marked |
No, I don't believe so. It looks like a conversion error, because it's consistent wherever the pattern is: PREPm- where PREP is one of the single letter prepositions. It happens at least 6 times in this document about M.G.M., and I can't imagine it was spelled (only after prepositions) as: ם-ג-מ with a final 'm' only in the first position. |
Do you have access to the original HTB? I know UD likes to follow the principle that each sentences text string should exactly match the source text, and paragraph boundaries noted if possible. |
Possibly related: the acronym for "member of parliament" appears correctly in tokens, but incorrectly with final letter in MWT throughout the corpus, as in token 15 below.
|
Yes, I think I might have it lying around, though I doubt the tokenization still matches 1:1, so much has happened to this corpus since then... And it's a big job to volunteer for, to fix everything based on the original! |
Here's another example, I think there is no way this is in the original text:
in # sent_id = 2687
# text = למרות שהאיצטדיון ישן, כר הדשא מצויין והמארגנים מצפים לך-7,000 צופים.
1 למרות למרות ADP ADP _ 5 mark _ _
2-4 שהאיצטדיון _ _ _ _ _ _ _ _
2 ש ש SCONJ SCONJ _ 1 fixed _ _
3 ה ה DET DET Definite=Def|PronType=Art 4 det _ _
4 איצטדיון אצטדיון NOUN NOUN Gender=Masc|Number=Sing 5 nsubj _ _
5 ישן ישן ADJ ADJ Gender=Masc|Number=Sing 10 advcl _ SpaceAfter=No
6 , , PUNCT PUNCT _ 5 punct _ _
7 כר כר NOUN NOUN Definite=Cons|Gender=Masc|Number=Sing 10 nsubj _ _
8-9 הדשא _ _ _ _ _ _ _ _
8 ה ה DET DET Definite=Def|PronType=Art 9 det _ _
9 דשא דשא NOUN NOUN Gender=Masc|Number=Sing 7 compound:smixut _ _
10 מצויין צוין VERB VERB Gender=Masc|HebBinyan=PUAL|Number=Sing|Person=1,2,3|VerbForm=Part|Voice=Pass 0 root _ _
11-13 והמארגנים _ _ _ _ _ _ _ _
11 ו ו CCONJ CCONJ _ 14 cc _ _
12 ה ה DET DET Definite=Def|PronType=Art 13 det _ _
13 מארגנים ארגן VERB VERB Gender=Masc|HebBinyan=PIEL|Number=Plur|Person=1,2,3|VerbForm=Part|Voice=Act 14 nsubj _ _
14 מצפים ציפה VERB VERB Gender=Masc|HebBinyan=PIEL|Number=Plur|Person=1,2,3|VerbForm=Part|Voice=Act 10 conj _ _
15-16 לך _ _ _ _ _ _ _ SpaceAfter=No
15 ל ל ADP ADP _ 19 case _ _
16 כ כ ADV ADV _ 19 advmod _ _
17 - - PUNCT PUNCT _ 16 punct _ SpaceAfter=No
18 7,000 7,000 NUM NUM _ 19 nummod _ _
19 צופים צופה NOUN NOUN Gender=Masc|Number=Plur 14 obl _ SpaceAfter=No
20 . . PUNCT PUNCT _ 10 punct _ _ |
OK, in case someone is following/finds these issues and looking for a fix, we now have a first draft of a new and cleaned up version of the dataset which resolves most of the issues mentioned here: https://github.com/IAHLT/UD_Hebrew It's a work in progress compiled by the Israeli Association for Human Language Technology, and it aims to be up to date and valid based on the UD validator; I should add a warning though that it makes some different annotation decisions, most notably it has no inserted tokens beyond what is in the text, so it is more similar to UD Arabic -- all MWTs are textually identical to the sum of their component tokens. Hope this is useful - feedback and contributions are welcome! |
Some MWTs which looks outwardly like complex Hebrew word forms seem to have automatically inserted final letters where they don't belong, for example:
The token text in node 11 is correct, but the sentence and MWT text is wrong (this is the name of the studio MGM, not "בם")
The text was updated successfully, but these errors were encountered: