Skip to content

Commit 3f6e2d2

Browse files
authored
fix bug for vocab size
1 parent 1cd2772 commit 3f6e2d2

File tree

1 file changed

+2
-2
lines changed

1 file changed

+2
-2
lines changed

datastore/get_datastore_code.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -41,7 +41,7 @@
4141
writer = draftretriever.Writer(
4242
index_file_path=datastore_path,
4343
max_chunk_len=512 * 1024 * 1024,
44-
vocab_size=tokenizer.vocab_size,
44+
vocab_size=tokenizer.vocab_size + len(tokenizer.get_added_vocab()),
4545
)
4646

4747
total_length = len(dataset)
@@ -51,4 +51,4 @@
5151
token_list = tokenizer.encode(sample['content'])
5252
writer.add_entry(token_list)
5353

54-
writer.finalize()
54+
writer.finalize()

0 commit comments

Comments
 (0)