Skip to content

Add the auto aligner to make subword tokenizer easier. #505

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 23 commits into from
Aug 19, 2021

Conversation

hunterhector
Copy link
Member

@hunterhector hunterhector commented Aug 16, 2021

This PR fixes #504.

Description of changes

If no word tokenization is provided:

  1. A auto alignment util forte.utils.utils.DiffAligner is implemented, this will solve the issue even if the tokenizer does not contain span information.
  2. Uses the diff aligner to find the spans returned by the BasicTokenizer from texar-pytorch
  3. Call workpiece after this.
  4. Store the unk value and vocab id of the Subword

If word tokenization is provided, then will use the word tokenization directly to create sub-words.

Other changes:

  1. Add two more attributes to subword (is_unk and vocab_id).
  2. During testing, a bug is found where lowercase might change the length of the tokens for certain unicode character, this may cause issues in another LowerCaserProcessor, so this is fixed together in this PR.

Possible influences of this PR.

N/A

Test Conducted

  1. Added test cases for DiffAligner
  2. Added test cases for SubworkdTokenizerProcessor
  3. Added new test cases for SubworkdTokenizerProcessor involving problematic unicode string
  4. Added test cases for LowerCaseProcessor involving problematic unicode string

@codecov
Copy link

codecov bot commented Aug 17, 2021

Codecov Report

Merging #505 (0942d84) into master (c90bd1b) will increase coverage by 0.15%.
The diff coverage is 85.34%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #505      +/-   ##
==========================================
+ Coverage   79.04%   79.20%   +0.15%     
==========================================
  Files         215      216       +1     
  Lines       15277    15455     +178     
==========================================
+ Hits        12076    12241     +165     
- Misses       3201     3214      +13     
Impacted Files Coverage Δ
forte/data/data_pack.py 79.01% <0.00%> (-0.16%) ⬇️
forte/processors/nlp/subword_tokenizer.py 70.66% <62.50%> (+22.51%) ⬆️
tests/forte/processors/subword_tokenizer_test.py 84.37% <84.37%> (ø)
forte/utils/utils.py 85.95% <98.11%> (+9.13%) ⬆️
forte/processors/misc/lowercaser_processor.py 100.00% <100.00%> (ø)
ft/onto/base_ontology.py 94.81% <100.00%> (+1.06%) ⬆️
ft/onto/race_multi_choice_qa_ontology.py 100.00% <100.00%> (ø)
...orte/data/ontology/ontology_code_generator_test.py 100.00% <100.00%> (ø)
...ests/forte/processors/lowercaser_processor_test.py 96.87% <100.00%> (+2.43%) ⬆️
tests/forte/utils/utils_test.py 97.36% <100.00%> (+0.81%) ⬆️
... and 1 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c90bd1b...0942d84. Read the comment docs.

@hunterhector hunterhector merged commit 1df0951 into asyml:master Aug 19, 2021
@hunterhector hunterhector deleted the test_case branch August 19, 2021 15:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Incorrect behavior in the subword tokenizer
1 participant