Open
Description
Tokens containing multiple adjacent digits are inverted (character order is reversed) in MWT text and sentence comment text throughout the corpus, for example here:
# sent_id = 5930
# text = מ5491 עד 1989 היה זה אזור אסור.
1-2 מ5491 _ _ _ _ _ _ _ _
1 מ מ ADP ADP _ 2 case _ _
2 1945 1945 NUM NUM _ 7 nmod _ _
3 עד עד ADP ADP _ 4 case _ _
4 1989 1989 NUM NUM _ 7 nmod _ _
5 היה _ AUX AUX Gender=Masc|Number=Sing|Person=3|Polarity=Pos|Tense=Past|VerbType=Cop 7 cop _ _
6 זה זה PRON PRON Gender=Masc|Number=Sing|Person=3 7 nsubj _ _
7 אזור אזור NOUN NOUN Gender=Masc|Number=Sing 0 root _ _
8 אסור אסור ADJ ADJ Gender=Masc|Number=Sing 7 amod _ SpaceAfter=No
9 . . PUNCT PUNCT _ 7 punct _ _
https://github.com/UniversalDependencies/UD_Hebrew-HTB/blob/master/he_htb-ud-test.conllu#L6229-L6230
The second year number in this sentence is correct in both the tokens and the sentence text. The first year number is inverted in the MWT and sentence text, but not in the actual token. I suspect this only(?) happens if there is a MWT, but it's hard to be sure for numbers that aren't obviously year numbers without having the original underlying text.
Metadata
Metadata
Assignees
Labels
No labels