-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Tokenizer] Add replace_additional_special_tokens parameter to add_special_tokens #9144
[Tokenizer] Add replace_additional_special_tokens parameter to add_special_tokens #9144
Conversation
Thanks for your contribution! |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## develop #9144 +/- ##
========================================
Coverage 53.26% 53.26%
========================================
Files 652 652
Lines 105581 105606 +25
========================================
+ Hits 56237 56254 +17
- Misses 49344 49352 +8 ☔ View full report in Codecov by Sentry. |
5388c5b
to
20f9c6c
Compare
encoder_dict[token] = len(self.encoder.keys()) | ||
decoder_dict[len(self.decoder.keys())] = token | ||
current_encoder_length = len(self.encoder) + len(self.added_tokens_encoder) | ||
current_decoder_length = len(self.decoder) + len(self.added_tokens_decoder) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
current_encoder_length 和 current_decoder_length 等价于len(self) :包含所有speical token的vocab。
应该实现 def len() 方法,
def __len__(self):
return len(self.encoder) + len(self.added_tokens_encoder)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
for idx, token in enumerate(token_list): | ||
if token not in self.added_tokens_encoder: | ||
encoder_dict[token] = current_encoder_length + idx | ||
decoder_dict[current_decoder_length + idx] = token |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
如果token 在 self.added_tokens_encoder,token_id会不连续
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
PR types
Function optimization
PR changes
APIs
Description