Skip to content

Releases: OpenNMT/CTranslate2

CTranslate2 3.23.0

05 Dec 11:33
83caf67
Compare
Choose a tag to compare

New features

  • Support Phi model

Fixes and improvements

  • Fix the conversion for whisper without the "alignment_heads" in the "generation_config.json"
  • Fix forward batch

CTranslate2 3.22.0

22 Nov 21:31
d963499
Compare
Choose a tag to compare

New features

  • Support "sliding window" and "chunking input" for Mistral

Fixes and improvements

  • Take into account the "generation_config.json" and fix "lang_ids" getter for Whisper converter
  • Accept callback even on "generate_tokens" method
  • Fix iomp5 linking with latest Intel OpenAPI on Ubuntu
  • Fixed "decoder_start_token_id" for T5

CTranslate2 3.21.0

09 Nov 16:45
1e37b52
Compare
Choose a tag to compare

New features

  • Minimal Support for Mistral (Loader and Rotary extension for long sequence). No sliding yet
  • Support Distil-Whisper
  • Support Whisper-large-v3

CTranslate2 3.20.0

18 Sep 16:13
Compare
Choose a tag to compare

New features

  • Update the Transformers converter to support more model architectures:
    • MixFormerSequential (used by microsoft/phi-1_5)
  • Accept batch inputs in methods generate_tokens
  • Add method Generator.async_generate_tokens to return an asynchronous generator compatible with asyncio

Fixes and improvements

  • Remove the epsilon value in the softmax CPU kernel for consistency with other implementations
  • Optimize implementation of the Dynamic Time Wrapping (DTW) function (used for Whisper alignment)
  • Avoid an unnecessary copy of the input arguments in method Whisper::align

CTranslate2 3.19.0

31 Aug 14:36
Compare
Choose a tag to compare

Changes

  • Binary wheels for Python 3.7 are no longer built

New features

  • Build wheels for Python 3.12
  • Update the Transformers converter to support more model architectures:
    • Falcon-RW
    • DistilBERT
    • Llama with linear RoPE scaling (e.g. Vicuna v1.5)
    • Llama with a non default RoPE base period (e.g. CodeLlama)
  • Accept the token type IDs as inputs for encoder models
  • Add property GenerationStepResult.hypothesis_id to identify the different hypotheses when running random sampling with num_hypotheses > 1

Fixes and improvements

  • Improve performance of 8-bit models on CPU:
    • Vectorize the GEMM output dequantization
    • Fuse the GEMM output dequantization with bias and activation
  • Allow inputs shorter than 30 seconds in Whisper methods
  • Fix incorrect batch_id values passed to the callback function
  • Fix a shape error in models using both MQA and relative positions
  • Fix compilation error related to AVX512 when using GCC 7
  • Call .detach() on PyTorch tensors before getting the Numpy array in converters

CTranslate2 3.18.0

03 Aug 12:25
Compare
Choose a tag to compare

Changes

Converted models now uses the same floating point precision as the original models. For example, a model saved in float16 will be converted to a float16 model. Before this change, the weights were casted to float32 by default.

Similarly, selecting int8 keeps non quantized weights in their original precision unless a more specific quantization type is selected:

  • int8_float32
  • int8_float16
  • int8_bfloat16

New features

  • Add property compute_type to model instances
  • Extend the Python class StorageView with additional methods and properties:
    • to(dtype)
    • device_index
    • device
    • dtype
    • shape

Fixes and improvements

  • Update the function get_supported_compute_types to correctly return bfloat16 when supported
  • Update the HF Llama converter to accept extra tokens in the vocabulary
  • Fix a shape error when enabling return_alternatives with a model using relative positions
  • Fix a conversion error when using torch<1.13
  • Fix a type error when running Whisper models with the bfloat16 type
  • Update pybind11 to 2.11.1

CTranslate2 3.17.1

20 Jul 18:18
Compare
Choose a tag to compare

Fixes and improvements

  • Fix an error when running models with the new int8_bfloat16 computation type
  • Fix a vocabulary error when converting Llama 2 models with the Transformers converter
  • Update the Transformers converter to correctly convert Llama models using GQA
  • Stop the decoding when the generator returned by the method generate_tokens is closed

CTranslate2 3.17.0

18 Jul 10:26
Compare
Choose a tag to compare

New features

  • Add new computation types: bfloat16 and int8_bfloat16 (require a GPU with Compute Capability 8.0 or above)
  • Support multi-query attention for encoder-decoder models
  • Allow converters to register weights as PyTorch tensors instead of Numpy arrays

Fixes and improvements

  • Pass the flag trust_remote_code when loading the tokenizer in the Transformers converter
  • Improve performance of T5 models by reusing the same relative position bias in every layers
  • Whisper: disable the first timestamp decoding rule when a prefix is used
  • Install the CMake configuration in the correct library directory (e.g. some platforms use lib64 instead of lib)

CTranslate2 3.16.1

03 Jul 19:02
Compare
Choose a tag to compare

Fixes and improvements

  • Fix repeated outputs in version 3.16.0 when using include_prompt_in_result=False and a batch input with variable lengths: a typo in the code led to min_length being incorrectly applied
  • Update the Transformers converter to accept extra tokens for Falcon models
  • Release the Python GIL when loading the model
  • Initialize the rotary embeddings on the GPU instead of the CPU
  • Avoid a copy for the input features passed to the Whisper methods
  • Vectorize copy in the Tile CUDA operator

CTranslate2 3.16.0

15 Jun 15:01
Compare
Choose a tag to compare

New features

  • Update the Transformers converter to support more architectures:
    • Falcon-40B
    • XLM-RoBERTa
  • Add the generation option sampling_topp to enable top-p (nucleus) sampling
  • Save vocabulary files in the JSON format to better support tokens containing newlines or carriage returns

Fixes and improvements

  • Fix the application of min_length and max_length when using include_prompt_in_result=False and a batch input with variable lengths: the length constraint should only apply to the sequence after the prompt
  • Update oneDNN to 3.1.1