Skip to content

Numbers are inverted in MWT and sentence text #26

Open
@amir-zeldes

Description

@amir-zeldes

Tokens containing multiple adjacent digits are inverted (character order is reversed) in MWT text and sentence comment text throughout the corpus, for example here:

# sent_id = 5930
# text = מ5491 עד 1989 היה זה אזור אסור.
1-2	מ5491	_	_	_	_	_	_	_	_
1	מ	מ	ADP	ADP	_	2	case	_	_
2	1945	1945	NUM	NUM	_	7	nmod	_	_
3	עד	עד	ADP	ADP	_	4	case	_	_
4	1989	1989	NUM	NUM	_	7	nmod	_	_
5	היה	_	AUX	AUX	Gender=Masc|Number=Sing|Person=3|Polarity=Pos|Tense=Past|VerbType=Cop	7	cop	_	_
6	זה	זה	PRON	PRON	Gender=Masc|Number=Sing|Person=3	7	nsubj	_	_
7	אזור	אזור	NOUN	NOUN	Gender=Masc|Number=Sing	0	root	_	_
8	אסור	אסור	ADJ	ADJ	Gender=Masc|Number=Sing	7	amod	_	SpaceAfter=No
9	.	.	PUNCT	PUNCT	_	7	punct	_	_

https://github.com/UniversalDependencies/UD_Hebrew-HTB/blob/master/he_htb-ud-test.conllu#L6229-L6230

The second year number in this sentence is correct in both the tokens and the sentence text. The first year number is inverted in the MWT and sentence text, but not in the actual token. I suspect this only(?) happens if there is a MWT, but it's hard to be sure for numbers that aren't obviously year numbers without having the original underlying text.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions