Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Tokenizer] Add replace_additional_special_tokens parameter to add_special_tokens #9144

Merged
merged 3 commits into from
Sep 19, 2024

Conversation

lvdongyi
Copy link
Contributor

@lvdongyi lvdongyi commented Sep 14, 2024

PR types

Function optimization

PR changes

APIs

Description

  1. add replace_additional_special_tokens parameter to add_special_tokens
  2. add a test

Copy link

paddle-bot bot commented Sep 14, 2024

Thanks for your contribution!

@DrownFish19 DrownFish19 changed the title Add replace_additional_special_tokens parameter to add_special_tokens [Tokenizer] Add replace_additional_special_tokens parameter to add_special_tokens Sep 14, 2024
@lvdongyi lvdongyi closed this Sep 14, 2024
@lvdongyi lvdongyi reopened this Sep 16, 2024
Copy link

codecov bot commented Sep 16, 2024

Codecov Report

Attention: Patch coverage is 92.85714% with 2 lines in your changes missing coverage. Please review.

Project coverage is 53.26%. Comparing base (d906171) to head (9b4607e).
Report is 242 commits behind head on develop.

Files with missing lines Patch % Lines
paddlenlp/transformers/tokenizer_utils_base.py 88.23% 2 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff            @@
##           develop    #9144   +/-   ##
========================================
  Coverage    53.26%   53.26%           
========================================
  Files          652      652           
  Lines       105581   105606   +25     
========================================
+ Hits         56237    56254   +17     
- Misses       49344    49352    +8     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@lvdongyi lvdongyi force-pushed the dev-20240914-add-special-token branch from 5388c5b to 20f9c6c Compare September 17, 2024 03:50
encoder_dict[token] = len(self.encoder.keys())
decoder_dict[len(self.decoder.keys())] = token
current_encoder_length = len(self.encoder) + len(self.added_tokens_encoder)
current_decoder_length = len(self.decoder) + len(self.added_tokens_decoder)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

current_encoder_length 和 current_decoder_length 等价于len(self) :包含所有speical token的vocab。
应该实现 def len() 方法,

def __len__(self):
    return len(self.encoder) + len(self.added_tokens_encoder)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

for idx, token in enumerate(token_list):
if token not in self.added_tokens_encoder:
encoder_dict[token] = current_encoder_length + idx
decoder_dict[current_decoder_length + idx] = token
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

如果token 在 self.added_tokens_encoder,token_id会不连续

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Collaborator

@DrownFish19 DrownFish19 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Collaborator

@ZHUI ZHUI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ZHUI ZHUI merged commit 90cef20 into PaddlePaddle:develop Sep 19, 2024
6 of 12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants