Support DeepSeek Coder Model

Hey folks, I'm trying to use the [deepseek-coder-1.3b-base](https://huggingface.co/deepseek-ai/deepseek-coder-1.3b-base) model with bumblebee. I was delighted to find that the model, tokenizer and generation_config all load. But when trying to run inference I get the following error that's a bit hard for me to debug:
```
repo = {:hf, "deepseek-ai/deepseek-coder-1.3b-base"}
{:ok, model_info} = Bumblebee.load_model(repo, backend: {EXLA.Backend, client: :host})
{:ok, tokenizer} = Bumblebee.load_tokenizer(repo)
{:ok, generation_config} = Bumblebee.load_generation_config(repo)
serving = Bumblebee.Text.generation(model_info, tokenizer, generation_config)
prompt = "hello world"
Nx.Serving.run(serving, prompt)
```

```
** (ErlangError) Erlang error: "Could not decode field on position 1"
    (tokenizers 0.4.0) Tokenizers.Native.encoding_transform(#Tokenizers.Encoding<[length: 2, ids: [31702, 1835]]>, [pad: {2, [pad_id: nil, pad_token: "</s>", direction: :left]}])
    (elixir 1.15.7) lib/enum.ex:1693: Enum."-map/2-lists^map/1-1-"/2
    (bumblebee 0.4.2) lib/bumblebee/utils/tokenizers.ex:51: Bumblebee.Utils.Tokenizers.apply/4
    (nx 0.6.2) lib/nx.ex:4510: Nx.with_default_backend/2
    (bumblebee 0.4.2) lib/bumblebee/text/generation.ex:882: anonymous fn/4 in Bumblebee.Text.Generation.generation/4
    (nx 0.6.2) lib/nx/serving.ex:1704: anonymous fn/3 in Nx.Serving.handle_preprocessing/2
    (telemetry 1.2.1) /Users/jonas/Library/Caches/mix/installs/elixir-1.15.7-erts-14.1.1/f67c01eefcd351fd5b5511a96e61c42d/deps/telemetry/src/telemetry.erl:321: :telemetry.span/3
    #cell:776q3ifvc2hexaoavrvlcde7ehfkvusl:7: (file)
```

I'm using bumblebee 0.4.2

Here's the model spec

```
spec: %Bumblebee.Text.Llama{
    architecture: :for_causal_language_modeling,
    vocab_size: 32256,
    max_positions: 16384,
    hidden_size: 2048,
    intermediate_size: 5504,
    num_blocks: 24,
    num_attention_heads: 16,
    activation: :silu,
    layer_norm_epsilon: 1.0e-6,
    initializer_scale: 0.02,
    output_hidden_states: false,
    output_attentions: false,
    num_labels: 2,
    id_to_label: %{},
    pad_token_id: 0
  }
```
And here's the tokenizer 

```
%Bumblebee.Text.LlamaTokenizer{
  tokenizer: #Tokenizers.Tokenizer<[
    vocab_size: 32022,
    byte_fallback: false,
    continuing_subword_prefix: nil,
    dropout: nil,
    end_of_word_suffix: nil,
    fuse_unk: false,
    model_type: "bpe",
    unk_token: nil
  ]>,
  special_tokens: %{pad: "</s>", eos: "</s>", sep: "</s>", unk: "<unk>"},
  additional_special_tokens: []
}
```

It looks like the vocab size is not correct in the model spec, for example.

I think the tokenizer uses the correct vocabulary, because I can run this:

```
Bumblebee.Tokenizer.decode(tokenizer, [32015])
```
and it correctly returns <｜fim▁hole｜> , which is a deepseek specific token

Would be amazing if this model was supported, as deepseek-coder actually seems to be pretty good at elixir out of the box 🙇 

Thank you so much for your help!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support DeepSeek Coder Model #278

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Support DeepSeek Coder Model #278

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions