mpt: utf-8 support, perplexity testing, repeat penalty sampling #184

klosax · 2023-05-21T18:15:43Z

Follow up to pr #179 . Fixes issues #170 , #55 .

A better solution that will not affect the other examples that uses gpt_tokenize

Revert encoding of input to utf-8 in gpt_tokenize
Add decoding of utf-8 tokens on load in mpt main.
Small change to mpt import script
Add perplexity testing
Add repeat penalty sampling

examples/common.cpp

zenixls2 · 2023-05-22T10:15:56Z

examples/mpt/main.cpp

@@ -64,6 +64,105 @@ struct mpt_model {
    std::map<std::string, struct ggml_tensor *> tensors;
 };

+struct mpt_params {


since this is in cpp, is it possible to extend the struct gpt_params?

zenixls2 · 2023-05-22T10:22:13Z

examples/common.cpp

+    }
+
+
+    std::vector<std::pair<double, gpt_vocab::id>> logits_id;


logit is float. Maybe use std::pair<float, gpt_vocab::id>?

zenixls2 · 2023-05-22T10:31:39Z

also in llama.cpp/examples/main/main.cpp line 423
and in llama.cpp/llama.cpp line 1833

implements similar approach that makes use of the sorted flag to speed-up.

ggerganov · 2023-05-22T13:24:06Z

examples/mpt/perplexity.cpp

+    ggml_free(model.ctx);
+
+    return 0;
+}


No need to duplicate main.cpp in perplexity.cpp.
Just add a --perplexity CLI argument to main.cpp

klosax added 8 commits May 21, 2023 19:31

common: utf-8 decoder, reverted gpt_toeknize utf-8 convert

b93e3d7

Update common.h

cee8202

main: decode utf-8 tokens on load

4e097cf

mpt import: bug fix

7818f2f

common: style fixes

5652025

common: style fix

3e49cc0

Update common.h

6b1479e

common: revert gpt_tokenize utf-8 convert

1531091

ggerganov reviewed May 21, 2023

View reviewed changes

examples/common.cpp Outdated Show resolved Hide resolved

klosax added 7 commits May 21, 2023 20:28

Update common.cpp

a1975e2

Update common.cpp

9fda519

Update common.cpp

2b6b4d1

Add perplexity to mpt

e41930d

Update CMakeLists: perplexity

b1b3231

mpt-perplexity: fixes

f2b1fc8

Update perplexity.cpp

1235380

klosax changed the title ~~mpt utf-8 support, revert utf-8 encoding of input to gpt_tokenize~~ mpt utf-8 support and perplexity tool May 21, 2023

klosax added 2 commits May 22, 2023 00:21

common: add sampling with repeat penalty

1dda6c4

mpt-main: add repeat penalty sampling, add commandline parameters

011f0f3

klosax changed the title ~~mpt utf-8 support and perplexity tool~~ mpt: utf-8 support, perplexity tool, repeat penalty sampling May 21, 2023

klosax added 3 commits May 22, 2023 00:36

Update common.h

76c2e8d

mpt-main: style fixes

79d240b

Update perplexity.cpp

fc95b21

zenixls2 reviewed May 22, 2023

View reviewed changes

examples/common.cpp

}

std::vector<std::pair<double, gpt_vocab::id>> logits_id;

Copy link

zenixls2 May 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logit is float. Maybe use std::pair<float, gpt_vocab::id>?

ggerganov requested changes May 22, 2023

View reviewed changes

klosax mentioned this pull request May 22, 2023

Replit + MPT #145

Merged

klosax added 2 commits May 22, 2023 22:40

Delete perplexity.cpp

d5a6fd9

mpt: move perplexity to main

12824f7

mpt: move perplexity to main

e916a9a

klosax changed the title ~~mpt: utf-8 support, perplexity tool, repeat penalty sampling~~ mpt: utf-8 support, perplexity testing, repeat penalty sampling May 22, 2023

klosax added 2 commits May 23, 2023 00:06

common.cpp: Use codecvt utf-8 converter

ede162f

main.cpp: Use codecvt utf-8 converter

17d3e0c

This was referenced May 23, 2023

Add utf-8 support: gpt_tokenize / mpt model import #179

Merged

examples : add tokenization tests and refactor codes #186

Merged

mpt : code style changes

a4f72e8

ggerganov approved these changes May 24, 2023

View reviewed changes

ggerganov merged commit 9276285 into ggerganov:master May 24, 2023

MonkiesDance mentioned this pull request May 26, 2023

mpt - Add flags for prompt context size (-c/--ctx_size) #174

Closed

klosax mentioned this pull request May 29, 2023

gpt-neox : utf-8 support #207

Closed

NeonBohdan mentioned this pull request Jul 5, 2023

How to get the perplexity of the sequence marella/ctransformers#48

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mpt: utf-8 support, perplexity testing, repeat penalty sampling #184

mpt: utf-8 support, perplexity testing, repeat penalty sampling #184

klosax commented May 21, 2023 •

edited

Loading

zenixls2 May 22, 2023

zenixls2 May 22, 2023

zenixls2 commented May 22, 2023

ggerganov May 22, 2023

klosax May 22, 2023

mpt: utf-8 support, perplexity testing, repeat penalty sampling #184

mpt: utf-8 support, perplexity testing, repeat penalty sampling #184

Conversation

klosax commented May 21, 2023 • edited Loading

zenixls2 May 22, 2023

Choose a reason for hiding this comment

zenixls2 May 22, 2023

Choose a reason for hiding this comment

zenixls2 commented May 22, 2023

ggerganov May 22, 2023

Choose a reason for hiding this comment

klosax May 22, 2023

Choose a reason for hiding this comment

klosax commented May 21, 2023 •

edited

Loading