Skip to content

Commit 215c69e

Browse files
committed
Don't clobber a token's text in the event only a single Word is created for a supposedly MWT Token. This came up while training the Albanian MWT processor
1 parent f534d73 commit 215c69e

File tree

1 file changed

+6
-0
lines changed

1 file changed

+6
-0
lines changed

stanza/models/common/doc.py

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -366,6 +366,12 @@ def set_mwt_expansions(self, expansions,
366366
word.id = idx_w
367367
elif perform_mwt_processing == MWTProcessingType.PROCESS:
368368
expanded = [x for x in expansions[idx_e].split(' ') if len(x) > 0]
369+
# in the event the MWT annotator only split the
370+
# Token into a single Word, we preserve its text
371+
# otherwise the Token's text is different from its
372+
# only Word's text
373+
if len(expanded) == 1:
374+
expanded = [token.text]
369375
idx_e += 1
370376
idx_w_end = idx_w + len(expanded) - 1
371377
if token.misc: # None can happen when using a prebuilt doc

0 commit comments

Comments
 (0)