add docs

OpenNMT · Mar 4, 2024 · ac8f7ae · panosk · Mar 4, 2024 · panosk
1 parent 1da6b31
commit ac8f7ae
Show file tree

Hide file tree

Showing 2 changed files with 44 additions and 1 deletion.
diff --git a/README.md b/README.md
@@ -34,6 +34,7 @@ The project is production-oriented and comes with [backward compatibility guaran
 * **Lightweight on disk**<br/>Quantization can make the models 4 times smaller on disk with minimal accuracy loss.
 * **Simple integration**<br/>The project has few dependencies and exposes simple APIs in [Python](https://opennmt.net/CTranslate2/python/overview.html) and C++ to cover most integration needs.
 * **Configurable and interactive decoding**<br/>[Advanced decoding features](https://opennmt.net/CTranslate2/decoding.html) allow autocompleting a partial sequence and returning alternatives at a specific location in the sequence.
+* **Support tensor parallelism for distributed inference.
 
 Some of these features are difficult to achieve with standard deep learning frameworks and are the motivation for this project.
 

diff --git a/docs/parallel.md b/docs/parallel.md
@@ -42,8 +42,50 @@ Parallelization with multiple Python threads is possible because all computation
 ```
 
 ## Model and tensor parallelism
+Models as the [`Translator`](python/ctranslate2.Translator.rst) and [`Generator`](python/ctranslate2.Generator.rst) can be split into multiple GPUs different.
+This is very helpful when the model is too big to be load in only 1 GPU.
 
-These types of parallelism are not yet implemented in CTranslate2.
+```python
+translator = ctranslate2.Translator(model_path, device="cuda", tensor_parallel=True)
+```
+
+Setup environment:
+* Install [open-mpi](https://www.open-mpi.org/)
+* Configure open-mpi by creating the config file like ``hostfile``:
+```bash
+[ipaddress or dns] slots=nbGPU1
+[other ipaddress or dns] slots=NbGPU2
+```
+
+Run:
+* Run the application in multiprocess to using tensor parallel:
+```bash
+mpirun -np nbGPUExpected -hostfile hostfile python3 script
+```
+
+If you're trying to run the tensor parallelism in multiple machine, there are additional configuration is needed:
+* Make sure Master and Slave can connect to each other as a pair with ssh + pubkey
+* Export all necessary environment variables from Master to Slave like the example below:
+```bash
+mpirun -x VIRTUAL_ENV_PROMPT -x PATH -x VIRTUAL_ENV -x _ -x LD_LIBRARY_PATH -np nbGPUExpected -hostfile hostfile python3 script
+```
+Read more [open-mpi docs](https://www.open-mpi.org/doc/) for more information.
+
+* In this mode, the application will be run in multiprocess. We can filter out the master process by using:
+```python
+if ctranslate2.MpiInfo.getCurRank() == 0:
+    print(...)
+```
+
+```{note}
+Running model in tensor parallel mode in one machine can boost the performance but if running the model shared between multiple
+machine could be slower because of the latency in the connectivity.
+```
+
+```{note}
+In mode tensor parallel, `inter_threads` is always supported to run multiple workers. Otherwise, `device_index` no longer has any effect
+because tensor parallel mode will check only available gpus on the system and number of gpu that you want to use.
+```
 
 ## Asynchronous execution