Tokenizer for our byte based transformer model. See: https://huggingface.co/SzegedAI/charmen-electra
pip install git+https://github.com/szegedai/byte-offset-tokenizer.git
from byte_offset_tokenizer import ByteOffsetTokenizer
tokenizer = ByteOffsetTokenizer()
tokenizer('Példa mondat!')
Output:
{'input_ids': [array([3, 3, 3, ..., 0, 0, 0])], 'attention_mask': [array([ True, True, True, ..., False, False, False])], 'token_type_ids': [array([0, 0, 0, ..., 0, 0, 0])]}