Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MWT and sentence text have incorrect final letters not found in tokens #27

Open
amir-zeldes opened this issue May 24, 2021 · 7 comments

Comments

@amir-zeldes
Copy link

Some MWTs which looks outwardly like complex Hebrew word forms seem to have automatically inserted final letters where they don't belong, for example:

# sent_id = 3488
# text = יש ביוניוורסל תשובה לכל מה שיש בם-ג-מ, ואף יותר.
1	יש	יש	VERB	VERB	HebExistential=Yes	0	root	_	_
2-3	ביוניוורסל	_	_	_	_	_	_	_	_
2	ב	ב	ADP	ADP	_	3	case	_	_
3	יוניוורסל	יוניוורסל	PROPN	PROPN	_	1	obl	_	_
4	תשובה	תשובה	NOUN	NOUN	Gender=Fem|Number=Sing	1	nsubj	_	_
5-6	לכל	_	_	_	_	_	_	_	_
5	ל	ל	ADP	ADP	_	7	case	_	_
6	כל	כול	DET	DET	Definite=Cons	7	det	_	_
7	מה	מה	ADV	ADV	PronType=Int	4	nmod	_	_
8-9	שיש	_	_	_	_	_	_	_	_
8	ש	ש	SCONJ	SCONJ	_	9	mark	_	_
9	יש	יש	VERB	VERB	HebExistential=Yes	7	acl:relcl	_	_
10-11	בם	_	_	_	_	_	_	_	SpaceAfter=No
10	ב	ב	ADP	ADP	_	11	case	_	_
11	מ	מ	PROPN	PROPN	_	9	obl	_	_
12	-	-	PUNCT	PUNCT	_	13	punct	_	SpaceAfter=No
13	ג	ג	PROPN	PROPN	_	11	flat:name	_	SpaceAfter=No
14	-	-	PUNCT	PUNCT	_	15	punct	_	SpaceAfter=No
15	מ	מ	PROPN	PROPN	_	11	flat:name	_	SpaceAfter=No
16	,	,	PUNCT	PUNCT	_	17	punct	_	_
...

The token text in node 11 is correct, but the sentence and MWT text is wrong (this is the name of the studio MGM, not "בם")

@nschneid
Copy link

If the error is in the source text I assume it should be marked Typo=Yes

@amir-zeldes
Copy link
Author

No, I don't believe so. It looks like a conversion error, because it's consistent wherever the pattern is:

PREPm-

where PREP is one of the single letter prepositions. It happens at least 6 times in this document about M.G.M., and I can't imagine it was spelled (only after prepositions) as:

ם-ג-מ

with a final 'm' only in the first position.

@nschneid
Copy link

Do you have access to the original HTB? I know UD likes to follow the principle that each sentences text string should exactly match the source text, and paragraph boundaries noted if possible.

@amir-zeldes
Copy link
Author

Possibly related: the acronym for "member of parliament" appears correctly in tokens, but incorrectly with final letter in MWT throughout the corpus, as in token 15 below.

# sent_id = 1870
# text = בתום ישיבה סוערת, החליטה השבוע הנהלת סיעת הליכוד לצרף את הח"ך החדש, חיים קופמן, לוועדת הכספים.
1-2	בתום	_	_	_	_	_	_	_	_
1	ב	ב	ADP	ADP	_	2	case	_	_
2	תום	תום	NOUN	NOUN	Definite=Cons|Gender=Masc|Number=Sing	6	obl	_	_
3	ישיבה	ישיבה	NOUN	NOUN	Gender=Fem|Number=Sing	2	compound:smixut	_	_
4	סוערת	סוער	ADJ	ADJ	Gender=Fem|Number=Sing	3	amod	_	SpaceAfter=No
5	,	,	PUNCT	PUNCT	_	2	punct	_	_
6	החליטה	החליט	VERB	VERB	Gender=Fem|HebBinyan=HIFIL|Number=Sing|Person=3|Tense=Past|Voice=Act	0	root	_	_
7	השבוע	השבוע	ADV	ADV	_	6	advmod	_	_
8	הנהלת	הנהלה	NOUN	NOUN	Definite=Cons|Gender=Fem|Number=Sing	6	nsubj	_	_
9	סיעת	סיעה	NOUN	NOUN	Definite=Cons|Gender=Fem|Number=Sing	8	compound:smixut	_	_
10-11	הליכוד	_	_	_	_	_	_	_	_
10	ה	ה	DET	DET	Definite=Def|PronType=Art	11	det	_	_
11	ליכוד	ליכוד	NOUN	NOUN	Gender=Masc|Number=Sing	9	compound:smixut	_	_
12	לצרף	צירף	VERB	VERB	HebBinyan=PIEL|VerbForm=Inf|Voice=Act	6	xcomp	_	_
13	את	את	ADP	ADP	Case=Acc	15	case:acc	_	_
14-15	הח"ך	_	_	_	_	_	_	_	_
14	ה	ה	DET	DET	Definite=Def|PronType=Art	15	det	_	_
15	ח"כ	_	NOUN	NOUN	Abbr=Yes|Gender=Masc|Number=Sing	12	obj	_	_
...

@amir-zeldes
Copy link
Author

Yes, I think I might have it lying around, though I doubt the tokenization still matches 1:1, so much has happened to this corpus since then... And it's a big job to volunteer for, to fix everything based on the original!

@amir-zeldes
Copy link
Author

Here's another example, I think there is no way this is in the original text:

15-16 לך

in

# sent_id = 2687
# text = למרות שהאיצטדיון ישן, כר הדשא מצויין והמארגנים מצפים לך-7,000 צופים.
1	למרות	למרות	ADP	ADP	_	5	mark	_	_
2-4	שהאיצטדיון	_	_	_	_	_	_	_	_
2	ש	ש	SCONJ	SCONJ	_	1	fixed	_	_
3	ה	ה	DET	DET	Definite=Def|PronType=Art	4	det	_	_
4	איצטדיון	אצטדיון	NOUN	NOUN	Gender=Masc|Number=Sing	5	nsubj	_	_
5	ישן	ישן	ADJ	ADJ	Gender=Masc|Number=Sing	10	advcl	_	SpaceAfter=No
6	,	,	PUNCT	PUNCT	_	5	punct	_	_
7	כר	כר	NOUN	NOUN	Definite=Cons|Gender=Masc|Number=Sing	10	nsubj	_	_
8-9	הדשא	_	_	_	_	_	_	_	_
8	ה	ה	DET	DET	Definite=Def|PronType=Art	9	det	_	_
9	דשא	דשא	NOUN	NOUN	Gender=Masc|Number=Sing	7	compound:smixut	_	_
10	מצויין	צוין	VERB	VERB	Gender=Masc|HebBinyan=PUAL|Number=Sing|Person=1,2,3|VerbForm=Part|Voice=Pass	0	root	_	_
11-13	והמארגנים	_	_	_	_	_	_	_	_
11	ו	ו	CCONJ	CCONJ	_	14	cc	_	_
12	ה	ה	DET	DET	Definite=Def|PronType=Art	13	det	_	_
13	מארגנים	ארגן	VERB	VERB	Gender=Masc|HebBinyan=PIEL|Number=Plur|Person=1,2,3|VerbForm=Part|Voice=Act	14	nsubj	_	_
14	מצפים	ציפה	VERB	VERB	Gender=Masc|HebBinyan=PIEL|Number=Plur|Person=1,2,3|VerbForm=Part|Voice=Act	10	conj	_	_
15-16	לך	_	_	_	_	_	_	_	SpaceAfter=No
15	ל	ל	ADP	ADP	_	19	case	_	_
16	כ	כ	ADV	ADV	_	19	advmod	_	_
17	-	-	PUNCT	PUNCT	_	16	punct	_	SpaceAfter=No
18	7,000	7,000	NUM	NUM	_	19	nummod	_	_
19	צופים	צופה	NOUN	NOUN	Gender=Masc|Number=Plur	14	obl	_	SpaceAfter=No
20	.	.	PUNCT	PUNCT	_	10	punct	_	_

@amir-zeldes
Copy link
Author

OK, in case someone is following/finds these issues and looking for a fix, we now have a first draft of a new and cleaned up version of the dataset which resolves most of the issues mentioned here:

https://github.com/IAHLT/UD_Hebrew

It's a work in progress compiled by the Israeli Association for Human Language Technology, and it aims to be up to date and valid based on the UD validator; I should add a warning though that it makes some different annotation decisions, most notably it has no inserted tokens beyond what is in the text, so it is more similar to UD Arabic -- all MWTs are textually identical to the sum of their component tokens.

Hope this is useful - feedback and contributions are welcome!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants