Skip to content

Add decode stream #64

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 12 commits into from
Apr 29, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
60 changes: 60 additions & 0 deletions lib/tokenizers/decode_stream.ex
Original file line number Diff line number Diff line change
@@ -0,0 +1,60 @@
defmodule Tokenizers.DecodeStream do
@moduledoc """
Implements streaming decoding functionality for tokenizers.
"""

@enforce_keys [:resource]
defstruct [:resource]

@type t :: %__MODULE__{
resource: reference()
}

@doc """
Creates a new decode stream.

## Options

* `:skip_special_tokens` - determines whether special tokens should be
skipped during decoding. By default, it is set to `false`.

"""
@spec new(keyword()) :: t()
def new(opts \\ []) when is_list(opts) do
opts = Keyword.validate!(opts, skip_special_tokens: false)
Tokenizers.Native.decoder_stream_new(opts[:skip_special_tokens])
end

@doc """
Steps through the decode stream with the given tokenizer and token ID.

Returns `{:ok, String.t()}` if there's a decoded string, or `{:ok, :out_ofr_range}` if the token ID is out of range.
Returns `{:error, reason}` if an error occurs during decoding.
"""
def step(%__MODULE__{} = decode_stream, tokenizer, id) when is_integer(id) do
case Tokenizers.Native.decoder_stream_step(decode_stream, tokenizer, id) do
{:ok, decoded} when is_binary(decoded) ->
{:ok, decoded}

{:ok, nil} ->
{:ok, :out_of_range}

{:error, reason} ->
{:error, reason}
end
end

@doc """
Returns information about the decode stream state.
"""
defdelegate info(decode_stream), to: Tokenizers.Native, as: :decoder_stream_info

defimpl Inspect do
import Inspect.Algebra
alias Tokenizers.DecodeStream

def inspect(decode_stream, opts) do
"#Tokenizers.DecodeStream<#{to_doc(DecodeStream.info(decode_stream), opts)}>"
end
end
end
7 changes: 7 additions & 0 deletions lib/tokenizers/native.ex
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,13 @@ defmodule Tokenizers.Native do
def decoders_ctc(_options), do: err()
def decoders_sequence(_decoders), do: err()

# DecoderStream
def decoder_stream_step(_decoder_stream, _tokenizer, _id), do: err()
#
def decoder_stream_info(_decoder_stream), do: err()
#
def decoder_stream_new(_skip_special_tokens), do: err()

# Encoding
def encoding_get_length(_encoding), do: err()
def encoding_get_n_sequences(_encoding), do: err()
Expand Down
Loading
Loading