Skip to content

Support DeepSeek Coder Model #278

Closed
@jonastemplestein

Description

@jonastemplestein

Hey folks, I'm trying to use the deepseek-coder-1.3b-base model with bumblebee. I was delighted to find that the model, tokenizer and generation_config all load. But when trying to run inference I get the following error that's a bit hard for me to debug:

repo = {:hf, "deepseek-ai/deepseek-coder-1.3b-base"}
{:ok, model_info} = Bumblebee.load_model(repo, backend: {EXLA.Backend, client: :host})
{:ok, tokenizer} = Bumblebee.load_tokenizer(repo)
{:ok, generation_config} = Bumblebee.load_generation_config(repo)
serving = Bumblebee.Text.generation(model_info, tokenizer, generation_config)
prompt = "hello world"
Nx.Serving.run(serving, prompt)
** (ErlangError) Erlang error: "Could not decode field on position 1"
    (tokenizers 0.4.0) Tokenizers.Native.encoding_transform(#Tokenizers.Encoding<[length: 2, ids: [31702, 1835]]>, [pad: {2, [pad_id: nil, pad_token: "</s>", direction: :left]}])
    (elixir 1.15.7) lib/enum.ex:1693: Enum."-map/2-lists^map/1-1-"/2
    (bumblebee 0.4.2) lib/bumblebee/utils/tokenizers.ex:51: Bumblebee.Utils.Tokenizers.apply/4
    (nx 0.6.2) lib/nx.ex:4510: Nx.with_default_backend/2
    (bumblebee 0.4.2) lib/bumblebee/text/generation.ex:882: anonymous fn/4 in Bumblebee.Text.Generation.generation/4
    (nx 0.6.2) lib/nx/serving.ex:1704: anonymous fn/3 in Nx.Serving.handle_preprocessing/2
    (telemetry 1.2.1) /Users/jonas/Library/Caches/mix/installs/elixir-1.15.7-erts-14.1.1/f67c01eefcd351fd5b5511a96e61c42d/deps/telemetry/src/telemetry.erl:321: :telemetry.span/3
    #cell:776q3ifvc2hexaoavrvlcde7ehfkvusl:7: (file)

I'm using bumblebee 0.4.2

Here's the model spec

spec: %Bumblebee.Text.Llama{
    architecture: :for_causal_language_modeling,
    vocab_size: 32256,
    max_positions: 16384,
    hidden_size: 2048,
    intermediate_size: 5504,
    num_blocks: 24,
    num_attention_heads: 16,
    activation: :silu,
    layer_norm_epsilon: 1.0e-6,
    initializer_scale: 0.02,
    output_hidden_states: false,
    output_attentions: false,
    num_labels: 2,
    id_to_label: %{},
    pad_token_id: 0
  }

And here's the tokenizer

%Bumblebee.Text.LlamaTokenizer{
  tokenizer: #Tokenizers.Tokenizer<[
    vocab_size: 32022,
    byte_fallback: false,
    continuing_subword_prefix: nil,
    dropout: nil,
    end_of_word_suffix: nil,
    fuse_unk: false,
    model_type: "bpe",
    unk_token: nil
  ]>,
  special_tokens: %{pad: "</s>", eos: "</s>", sep: "</s>", unk: "<unk>"},
  additional_special_tokens: []
}

It looks like the vocab size is not correct in the model spec, for example.

I think the tokenizer uses the correct vocabulary, because I can run this:

Bumblebee.Tokenizer.decode(tokenizer, [32015])

and it correctly returns <|fim▁hole|> , which is a deepseek specific token

Would be amazing if this model was supported, as deepseek-coder actually seems to be pretty good at elixir out of the box 🙇

Thank you so much for your help!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions