Skip to content

mpt-7b-ggml generating garbled characters #272

Open
@taiyou2000

Description

I tried to use mpt-7b-ggml-q5_1(https://huggingface.co/TheBloke/MPT-7B-GGML) with koboldcpp(commit hash: e6ddb15) on Ubuntu 22.04. It was fine with generating English alphabet but when it comes to characters in languages other than English, it's generating garbled characters like this:

������
2.
��都市
3.
����都

And the terminal is showing: gpt_tokenize: unknown token '�'

I also tried to run mpt with pytorch in colab and my computer but both encountered OOM error so I can't tell if this is whether ggml or pytorch/transformers side issue. But I think this is ggml side issue.
I suspected this is caused by misconfiguration of encoding in terminal. But it was UTF-8(ja_JP.UTF-8) and it is unlikely caused by terminal encoding.
https://github.com/ggerganov/ggml caused same result.

It seems like similar issue was discussed in early llama.cpp repository ggerganov#73

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions