HanBert on π€ Huggingface Transformers π€
- HanBert Tensorflow ckptλ₯Ό Pytorchλ‘ λ³ν
- κΈ°μ‘΄μ Optimizer κ΄λ ¨ Parameterλ μ κ±°νμ¬ κΈ°μ‘΄μ 1.43GBμμ 488MBλ‘ μ€μμ΅λλ€.
- λ³ν μ Optimizer κ΄λ ¨ νλΌλ―Έν°λ₯Ό Skip νμ§ λͺ»νλ μ΄μκ° μμ΄ ν΄λΉ λΆλΆμ κ³ μ³μ λ³ννμ΅λλ€. (ν΄λΉ μ΄μ κ΄λ ¨ PR)
# transformers bert TF_CHECKPOINT TF_CONFIG PYTORCH_DUMP_OUTPUT
$ transformers bert HanBert-54kN/model.ckpt-3000000 \
HanBert-54kN/bert_config.json \
HanBert-54kN/pytorch_model.bin
- Tokenizerλ₯Ό μνμ¬
tokenization_hanbert.py
νμΌμ μλ‘ μ μ- Transformersμ tokenization κ΄λ ¨ ν¨μ μ§μ (
convert_tokens_to_ids
,convert_tokens_to_string
,encode_plus
...)
- Transformersμ tokenization κ΄λ ¨ ν¨μ μ§μ (
-
κ΄λ ¨ λΌμ΄λΈλ¬λ¦¬ μ€μΉ
- torch>=1.1.0
- transformers>=2.2.2
-
λͺ¨λΈ λ€μ΄λ‘λ ν μμΆ ν΄μ
- κΈ°μ‘΄μ HanBertμμλ tokenization κ΄λ ¨ νμΌμ
/usr/local/moran
μ 볡μ¬ν΄μΌ νμ§λ§, ν΄λΉ ν΄λ μ΄μ© μ κ·Έλλ μ¬μ© κ°λ₯ν©λλ€. - λ€μ΄λ‘λ λ§ν¬ (Pretrained weight + Tokenizer tool)
- κΈ°μ‘΄μ HanBertμμλ tokenization κ΄λ ¨ νμΌμ
-
tokenization_hanbert.py μ€λΉ
- Tokenizerμ κ²½μ° Ubuntu νκ²½μμλ§ μ¬μ© κ°λ₯ν©λλ€.
- νλ¨μ ννλ‘ λλ ν λ¦¬κ° μΈν μ΄ λμ΄ μμ΄μΌ ν©λλ€.
.
βββ ...
βββ HanBert-54kN-torch
β βββ config.json
β βββ pytorch_model.bin
β βββ vocab_54k.txt
β βββ libmoran4dnlp.so
β βββ moran.db
β βββ udict.txt
β βββ uentity.txt
βββ tokenization_hanbert.py
βββ ...
>>> import torch
>>> from transformers import BertModel
>>> model = BertModel.from_pretrained('HanBert-54kN-torch')
>>> input_ids = torch.LongTensor([[31, 51, 99], [15, 5, 0]])
>>> token_type_ids = torch.LongTensor([[0, 0, 0], [0, 0, 0]])
>>> attention_mask = torch.LongTensor([[1, 1, 1], [1, 1, 0]])
>>> sequence_output, pooled_output = model(input_ids, attention_mask, token_type_ids)
>>> sequence_output
tensor([[[-0.0938, -0.5030, 0.3765, ..., -0.4880, -0.4486, 0.3600],
[-0.6036, -0.1008, -0.2344, ..., -0.6606, -0.5762, 0.1021],
[-0.4353, 0.0970, -0.0781, ..., -0.7686, -0.4418, 0.4109]],
[[-0.7117, 0.2479, -0.8164, ..., 0.1509, 0.8337, 0.4054],
[-0.7867, -0.0443, -0.8754, ..., 0.0952, 0.5044, 0.5125],
[-0.8613, 0.0138, -0.9315, ..., 0.1651, 0.6647, 0.5509]]],
grad_fn=<AddcmulBackward>)
>>> from tokenization_hanbert import HanBertTokenizer
>>> tokenizer = HanBertTokenizer.from_pretrained('HanBert-54kN-torch')
>>> text = "λλ κ±Έμ΄κ°κ³ μλ μ€μ
λλ€. λλκ±Έμ΄ κ°κ³ μλ μ€μ
λλ€. μ λΆλ₯λκΈ°λ νλ€. μ λ¨ΉκΈ°λ νλ€."
>>> tokenizer.tokenize(text)
['λ', '~~λ', 'κ±Έμ΄κ°', '~~κ³ ', 'μ', '~~λ', 'μ€', '~~μ
', '~~λλ€', '.', 'λ', '##λκ±Έ', '##μ΄', 'κ°', '~~κ³ ', '~μ', '~~λ', 'μ€', '~~μ
', '~~λλ€', '.', 'μ', 'λΆλ₯', '~~λ', '~~κΈ°', '~~λ', 'ν', '~~λ€', '.', 'μ', 'λ¨Ή', '~~κΈ°', '~~λ', 'ν', '~~λ€', '.']
$ python3 test_hanbert.py --model_name_or_path HanBert-54kN-torch
$ python3 test_hanbert.py --model_name_or_path HanBert-54kN-IP-torch
max_seq_len = 50
μΌλ‘ μ€μ
NSMC (acc) | Naver-NER (F1) | |
---|---|---|
HanBert-54kN | 90.16 | 87.31 |
HanBert-54kN-IP | 88.72 | 86.57 |
KoBERT | 89.63 | 86.11 |
Bert-multilingual | 87.07 | 84.20 |
- NSMC (Naver Sentiment Movie Corpus) (Implementation of HanBert-nsmc)
- Naver NER (NER task on Naver NLP Challenge 2018) (Implementation of HanBert-NER)