-
Notifications
You must be signed in to change notification settings - Fork 26.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to detokenize a BertTokenizer output? #36
Comments
You can remove ' ##' but you cannot know if there was a space around punctuations tokens or uppercase words. |
Yes. I don't plan to include a reverse conversion of tokens in the tokenizer. |
In my case, I do:
|
Apostrophe is considered as a punctuation mark, but often it is an integrated part of the word. Regular
now if you fix original you will be able to restore the original words. |
@thomwolf could you point to the specific section of EDIT: is it this bit from tok_to_orig_index = []
orig_to_tok_index = []
all_doc_tokens = []
for (i, token) in enumerate(example.doc_tokens):
orig_to_tok_index.append(len(all_doc_tokens))
sub_tokens = tokenizer.tokenize(token)
for sub_token in sub_tokens:
tok_to_orig_index.append(i)
all_doc_tokens.append(sub_token) |
update coqa-ensemble codalab submission pipeline
Summary: This pull request replaces the default nn.Linear with our patched version that doesn't flatten the high dimensional tensors. Test Plan: Tested on a V4-8.
I was wondering if there's a proper way of detokenizing the output tokens, i.e., constructing the sentence back from the tokens? Considering the fact that the word-piece tokenisation introduces lots of
#
s.The text was updated successfully, but these errors were encountered: