Skip to content

MWT and sentence text have incorrect final letters not found in tokens #27

Open
@amir-zeldes

Description

@amir-zeldes

Some MWTs which looks outwardly like complex Hebrew word forms seem to have automatically inserted final letters where they don't belong, for example:

# sent_id = 3488
# text = יש ביוניוורסל תשובה לכל מה שיש בם-ג-מ, ואף יותר.
1	יש	יש	VERB	VERB	HebExistential=Yes	0	root	_	_
2-3	ביוניוורסל	_	_	_	_	_	_	_	_
2	ב	ב	ADP	ADP	_	3	case	_	_
3	יוניוורסל	יוניוורסל	PROPN	PROPN	_	1	obl	_	_
4	תשובה	תשובה	NOUN	NOUN	Gender=Fem|Number=Sing	1	nsubj	_	_
5-6	לכל	_	_	_	_	_	_	_	_
5	ל	ל	ADP	ADP	_	7	case	_	_
6	כל	כול	DET	DET	Definite=Cons	7	det	_	_
7	מה	מה	ADV	ADV	PronType=Int	4	nmod	_	_
8-9	שיש	_	_	_	_	_	_	_	_
8	ש	ש	SCONJ	SCONJ	_	9	mark	_	_
9	יש	יש	VERB	VERB	HebExistential=Yes	7	acl:relcl	_	_
10-11	בם	_	_	_	_	_	_	_	SpaceAfter=No
10	ב	ב	ADP	ADP	_	11	case	_	_
11	מ	מ	PROPN	PROPN	_	9	obl	_	_
12	-	-	PUNCT	PUNCT	_	13	punct	_	SpaceAfter=No
13	ג	ג	PROPN	PROPN	_	11	flat:name	_	SpaceAfter=No
14	-	-	PUNCT	PUNCT	_	15	punct	_	SpaceAfter=No
15	מ	מ	PROPN	PROPN	_	11	flat:name	_	SpaceAfter=No
16	,	,	PUNCT	PUNCT	_	17	punct	_	_
...

The token text in node 11 is correct, but the sentence and MWT text is wrong (this is the name of the studio MGM, not "בם")

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions