Add tokenizer test + revert to C++11 #355

ggerganov · 2023-03-21T15:29:02Z

Add test-tokenizer-0 to do a few tokenizations - feel free to expand
Added option to convert-pth-to-ggml.py script to dump just the vocabulary
Added ./models/ggml-vocab.bin containing just LLaMA vocab data (used for tests)
Added utility to load vocabulary file from previous point (temporary implementation)
Avoid using std::string_view and drop back to C++11 (hope I didn't break something)
Rename gpt_vocab -> llama_vocab
All CMake binaries go into ./bin/ now

Fabio3rs · 2023-03-21T16:09:19Z

What are the motivations for using C++11 over C++17?

It's possible to reduce the memory usage a little using std::string_view, for example taking my PR as a base (#305 ) should be possible to make id_to_token as a "source of the token strings" resizing the vector ( std::vector<token_score> id_to_token;) before taking pointers to it and declarating token_to_id as std::unordered_map<std::string_view, id> token_to_id;

Something like:

        vocab.id_to_token.resize(n_vocab);
/*...*/
            auto &tok_score = vocab.id_to_token[i];
            tok_score.tok = word;
            tok_score.score = score;

            vocab.token_to_id[tok_score.tok] = i;

* Hide unavailable backends & Add tooltip over backend count Hides unavailable backends from the user and if the program is launched without any backends made, it shows an error message to them stating no backends were found and to make them using the 'make' command Add tooltip when hovering over backend count label hovering over the new label that shows the backend count will explain what the numbers are, and show the users which backends are not available or built * add some code comments * hide "missing" if all are built move tooltip functions to helper functions section. hides the string "Missing: ..." from showing if all backends are available " if len(runopts)==6 else + " * small typo fix * remove wrongly added leftover device choosing code * fix labels * move tooltip to function * import vars logic fix --------- Co-authored-by: Concedo <39025047+LostRuins@users.noreply.github.com>

ggerganov added 2 commits March 21, 2023 11:32

Add tokenizer unit test + vocab-only data for tests

ecd982d

Revert back to C++11, avoid std::string_view in the tokenizer

a0d00bd

ggerganov merged commit eb34620 into master Mar 21, 2023

ggerganov deleted the tokenizer-test branch March 21, 2023 15:29

hristo-dinkov mentioned this pull request Mar 22, 2023

Keep getting an error about "no such file or directory", I dont understand? cocktailpeanut/dalai#209

Open

FNsi mentioned this pull request Mar 30, 2023

Make loading weights 10-100x faster #613

Merged

Piezoid mentioned this pull request Apr 22, 2023

Sample interface, new samplers, #1126

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add tokenizer test + revert to C++11 #355

Add tokenizer test + revert to C++11 #355

ggerganov commented Mar 21, 2023

Fabio3rs commented Mar 21, 2023

Add tokenizer test + revert to C++11 #355

Add tokenizer test + revert to C++11 #355

Conversation

ggerganov commented Mar 21, 2023

Fabio3rs commented Mar 21, 2023