Add the auto aligner to make subword tokenizer easier. #505

hunterhector · 2021-08-16T22:01:47Z

This PR fixes #504.

Description of changes

If no word tokenization is provided:

A auto alignment util forte.utils.utils.DiffAligner is implemented, this will solve the issue even if the tokenizer does not contain span information.
Uses the diff aligner to find the spans returned by the BasicTokenizer from texar-pytorch
Call workpiece after this.
Store the unk value and vocab id of the Subword

If word tokenization is provided, then will use the word tokenization directly to create sub-words.

Other changes:

Add two more attributes to subword (is_unk and vocab_id).
During testing, a bug is found where lowercase might change the length of the tokens for certain unicode character, this may cause issues in another LowerCaserProcessor, so this is fixed together in this PR.

Possible influences of this PR.

N/A

Test Conducted

Added test cases for DiffAligner
Added test cases for SubworkdTokenizerProcessor
Added new test cases for SubworkdTokenizerProcessor involving problematic unicode string
Added test cases for LowerCaseProcessor involving problematic unicode string

codecov · 2021-08-17T06:20:50Z

Codecov Report

Merging #505 (0942d84) into master (c90bd1b) will increase coverage by 0.15%.
The diff coverage is 85.34%.

@@            Coverage Diff             @@
##           master     #505      +/-   ##
==========================================
+ Coverage   79.04%   79.20%   +0.15%     
==========================================
  Files         215      216       +1     
  Lines       15277    15455     +178     
==========================================
+ Hits        12076    12241     +165     
- Misses       3201     3214      +13

Impacted Files	Coverage Δ
forte/data/data_pack.py	`79.01% <0.00%> (-0.16%)`	⬇️
forte/processors/nlp/subword_tokenizer.py	`70.66% <62.50%> (+22.51%)`	⬆️
tests/forte/processors/subword_tokenizer_test.py	`84.37% <84.37%> (ø)`
forte/utils/utils.py	`85.95% <98.11%> (+9.13%)`	⬆️
forte/processors/misc/lowercaser_processor.py	`100.00% <100.00%> (ø)`
ft/onto/base_ontology.py	`94.81% <100.00%> (+1.06%)`	⬆️
ft/onto/race_multi_choice_qa_ontology.py	`100.00% <100.00%> (ø)`
...orte/data/ontology/ontology_code_generator_test.py	`100.00% <100.00%> (ø)`
...ests/forte/processors/lowercaser_processor_test.py	`96.87% <100.00%> (+2.43%)`	⬆️
tests/forte/utils/utils_test.py	`97.36% <100.00%> (+0.81%)`	⬆️
... and 1 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c90bd1b...0942d84. Read the comment docs.

…nto test_case

hunterhector added 14 commits August 16, 2021 17:55

Add the auto aligner to make subword tokenizer easier.

7795e5f

Add vocab id to subword

efe1124

Black formats

2df276e

pylint

2bc8db9

pylint

c5bfb22

remove auto junk to simplify alignment

91899eb

remove test code

01e281d

formats

f3de07b

allow specifying segments

549f048

pylint

e697f88

mypy

bc71d80

pylint

d3285de

add error handling and fix a doctest

1a4c6ff

Fix the None default value issue.

2212cb9

hunterhector added 9 commits August 17, 2021 11:50

Add more info in check

c223b58

Merge branch 'master' into test_case

31ffe2c

Merge branch 'master' into test_case

c2367bd

Merge branch 'test_case' of git+ssh://github.com/hunterhector/forte i…

92ed932

…nto test_case

Merge branch 'master' of https://github.com/asyml/forte into test_case

3e56971

Handle the lowercasing problem while doing subword tokenization.

9567ee3

Fix pyling.

4755b49

Rename duplicated test name.

f555d67

Remove unwanted assert.

0942d84

hunterhector merged commit 1df0951 into asyml:master Aug 19, 2021

hunterhector deleted the test_case branch August 19, 2021 15:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add the auto aligner to make subword tokenizer easier. #505

Add the auto aligner to make subword tokenizer easier. #505

hunterhector commented Aug 16, 2021 •

edited

Loading

codecov bot commented Aug 17, 2021 •

edited

Loading

Add the auto aligner to make subword tokenizer easier. #505

Add the auto aligner to make subword tokenizer easier. #505

Conversation

hunterhector commented Aug 16, 2021 • edited Loading

Description of changes

Possible influences of this PR.

Test Conducted

codecov bot commented Aug 17, 2021 • edited Loading

Codecov Report

hunterhector commented Aug 16, 2021 •

edited

Loading

codecov bot commented Aug 17, 2021 •

edited

Loading