-
Notifications
You must be signed in to change notification settings - Fork 131
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
added byte level bpe for gpt2 (initial version)
- Loading branch information
Sergei Alonichau
committed
Mar 4, 2021
1 parent
66adfa8
commit 9d0c9dd
Showing
33 changed files
with
50,703 additions
and
100,613 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,10 +1,33 @@ | ||
ruby export_vocab.rb | ||
# For details of the original implementation see this https://huggingface.co/transformers/_modules/transformers/models/gpt2/tokenization_gpt2.html#GPT2Tokenizer | ||
# Note: we will not use UTF-8 as input encoding for the dictionary, instead each symbol is represented as an integer | ||
# this is because GPT2 uses byte level alphabet. We need to specify --input-enc=DEC when we build the dictionary | ||
# (see fa_line2chain_unicode --help for details). | ||
|
||
# produce pos.dict.utf8 file and tagset.txt: | ||
cat spiece.model.exportvocab.txt | awk 'BEGIN {FS="\t"} NF == 2 { if (NR > 1) { print $1 "\tWORD_ID_" NR-1 "\t" ($2 == 0 ? "-0.00001" : $2); } print "WORD_ID_" NR " " NR > "tagset.txt"; }' > pos.dict.utf8 | ||
# run this python script to generate tagset.txt and pos.dict.utf8 from vocab.json | ||
python export_vocab.py | ||
|
||
# zip it: | ||
# zip it the dictionary file | ||
zip pos.dict.utf8.zip pos.dict.utf8 | ||
|
||
# build as usual | ||
make -f Makefile.gnu lang=gpt2 all | ||
# make sure the tools are compiled and are in the path, see wiki for details | ||
|
||
# build LDB as usual, you should get an output like below and no error messages (do "clean" target if encounter errors) | ||
~/BlingFire/ldbsrc$ make -f Makefile.gnu lang=gpt2 all | ||
|
||
fa_build_conf \ | ||
--in=gpt2/ldb.conf.small \ | ||
--out=gpt2/tmp/ldb.mmap.small.txt | ||
fa_fsm2fsm_pack --type=mmap \ | ||
--in=gpt2/tmp/ldb.mmap.small.txt \ | ||
--out=gpt2/tmp/ldb.conf.small.dump \ | ||
--auto-test | ||
unzip -p gpt2/pos.dict.utf8.zip | \ | ||
fa_build_dict --input-enc=DEC --type=mph --raw --tagset=gpt2/tagset.txt --float-nums \ | ||
--out-fsm=gpt2/tmp/pos.dict.fsm.txt \ | ||
--out-k2i=gpt2/tmp/pos.dict.k2i.txt \ | ||
--out-i2info=gpt2/tmp/pos.dict.i2t.txt | ||
fa_fsm2fsm_pack --alg=triv --type=mealy-dfa --in=gpt2/tmp/pos.dict.fsm.txt --out=gpt2/tmp/pos.dict.fsm.small.dump --auto-test | ||
fa_fsm2fsm_pack --alg=triv --type=arr --force-flat --in=gpt2/tmp/pos.dict.k2i.txt --out=gpt2/tmp/pos.dict.k2i.small.dump --auto-test | ||
fa_fsm2fsm_pack --alg=fixed --type=mmap --in=gpt2/tmp/pos.dict.i2t.txt --out=gpt2/tmp/pos.dict.i2t.small.dump --auto-test | ||
fa_merge_dumps --out=ldb/gpt2.bin gpt2/tmp/ldb.conf.small.dump gpt2/tmp/pos.dict.fsm.small.dump gpt2/tmp/pos.dict.k2i.small.dump gpt2/tmp/pos.dict.i2t.small.dump | ||
|
Oops, something went wrong.