From 27092e4f3c8f23534a3ba2b5d2d35b048fa40ddf Mon Sep 17 00:00:00 2001
From: Minh-Thuc <46375464+minhthuc2502@users.noreply.github.com>
Date: Mon, 11 Mar 2024 17:12:05 +0100
Subject: [PATCH] Bump version 4.1.0 (#1638)

---
 CHANGELOG.md                  | 11 +++++++++++
 CONTRIBUTING.md               |  6 +++---
 README.md                     |  2 +-
 docs/installation.md          |  6 ++++--
 python/ctranslate2/version.py |  2 +-
 5 files changed, 20 insertions(+), 7 deletions(-)
diff --git a/CHANGELOG.md b/CHANGELOG.md
index 19d5893b6..02887b43d 100644
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -4,6 +4,17 @@
 
 ### Fixes and improvements
 
+## [v4.1.0](https://github.com/OpenNMT/CTranslate2/releases/tag/v4.1.0) (2024-03-11)
+
+### New features
+* Support Gemma Model (#1631)
+* Support Tensor Parallelism (#1599)
+
+### Fixes and improvements
+* Avoid initializing unused GPU (#1633)
+* Read very large tensor by chunk if the size > max value of int (#1636)
+* Update Readme
+
 ## [v4.0.0](https://github.com/OpenNMT/CTranslate2/releases/tag/v4.0.0) (2024-02-15)
 
 This major version introduces the breaking change while updating to cuda 12.
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
index d5cb05659..5f2f2d18e 100644
--- a/CONTRIBUTING.md
+++ b/CONTRIBUTING.md
@@ -137,14 +137,14 @@ Python wheels for Linux and Windows are compiled against NVIDIA libraries to sup
 To limit the size of the packages pushed to PyPI, some libraries are not included in the package and are dynamically loaded at runtime with `dlopen` (or `LoadLibraryA` on Windows).
 
 * `libcudart_static.a` (statically linked)
-* `libcublas.so.11` (dlopened at runtime in [`cublas_stub.cc`](https://github.com/OpenNMT/CTranslate2/blob/master/src/cuda/cublas_stub.cc))
+* `libcublas.so.12` (dlopened at runtime in [`cublas_stub.cc`](https://github.com/OpenNMT/CTranslate2/blob/master/src/cuda/cublas_stub.cc))
 * `libcudnn.so.8` (dynamically linked)
   * `libcudnn_ops_infer.so.8` (dlopened at runtime by `libcudnn.so.8`)
   * `libcudnn_cnn_infer.so.8` (dlopened at runtime by `libcudnn.so.8`)
 
-One of the benefits of this dynamic loading is that multiple versions of cuBLAS and cuDNN are supported by the same binary. In particular, users can install any CUDA 11.x version as long as it provides `libcublas.so.11`.
+One of the benefits of this dynamic loading is that multiple versions of cuBLAS and cuDNN are supported by the same binary. In particular, users can install any CUDA 12.x version as long as it provides `libcublas.so.12`.
 
-However, supporting a new major CUDA version (e.g. CUDA 11 to 12) requires updating the CUDA libraries used during compilation. This will be a breaking change for existing users since they would need to update their cuBLAS/cuDNN libraries and possibly [update their GPU driver](https://docs.nvidia.com/deploy/cuda-compatibility/).
+The Python library only support CUDA 12.x. C++ source code is always compatible with CUDA 11, possible to use CUDA 11 libraries during compilation to create CUDA 11.x support wheel.
 
 ### Updating other dependencies
 
diff --git a/README.md b/README.md
index 53fd07430..7ce65486b 100644
--- a/README.md
+++ b/README.md
@@ -34,7 +34,7 @@ The project is production-oriented and comes with [backward compatibility guaran
 * **Lightweight on disk**<br/>Quantization can make the models 4 times smaller on disk with minimal accuracy loss.
 * **Simple integration**<br/>The project has few dependencies and exposes simple APIs in [Python](https://opennmt.net/CTranslate2/python/overview.html) and C++ to cover most integration needs.
 * **Configurable and interactive decoding**<br/>[Advanced decoding features](https://opennmt.net/CTranslate2/decoding.html) allow autocompleting a partial sequence and returning alternatives at a specific location in the sequence.
-* **Support tensor parallelism for distributed inference.
+* **Support tensor parallelism for distributed inference**<br/>Very large model can be split into multiple GPUs. Following this [documentation](docs/parallel.md#model-and-tensor-parallelism) to set up the required environment.
 
 Some of these features are difficult to achieve with standard deep learning frameworks and are the motivation for this project.
 
diff --git a/docs/installation.md b/docs/installation.md
index 0ac0e129e..792f9f8bd 100644
--- a/docs/installation.md
+++ b/docs/installation.md
@@ -15,9 +15,9 @@ The Python wheels have the following requirements:
 * pip version: >= 19.3 to support `manylinux2014` wheels
 
 ```{admonition} GPU support
-The Linux and Windows Python wheels support GPU execution. Install [CUDA](https://developer.nvidia.com/cuda-toolkit) 11.x to use the GPU.
+The Linux and Windows Python wheels support GPU execution. Install [CUDA](https://developer.nvidia.com/cuda-toolkit) 12.x to use the GPU.
 
-If you plan to run models with convolutional layers (e.g. for speech recognition), you should also install [cuDNN 8](https://developer.nvidia.com/cudnn) for CUDA 11.x.
+If you plan to run models with convolutional layers (e.g. for speech recognition), you should also install [cuDNN 8](https://developer.nvidia.com/cudnn) for CUDA 12.x.
 ```
 
 ```{note}
@@ -43,6 +43,8 @@ The images include:
 docker run --rm ghcr.io/opennmt/ctranslate2:latest-ubuntu20.04-cuda11.2 --help
 ```
 
+To update to the new version that supports CUDA 12.
+
 ```{admonition} GPU support
 The Docker image supports GPU execution. Install the [NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/overview.html) to use GPUs from Docker.
 ```
diff --git a/python/ctranslate2/version.py b/python/ctranslate2/version.py
index 1e4b744c1..8b519f976 100644
--- a/python/ctranslate2/version.py
+++ b/python/ctranslate2/version.py
@@ -1,3 +1,3 @@
 """Version information."""
 
-__version__ = "4.0.0"
+__version__ = "4.1.0"