Skip to content

Commit

Permalink
add docs
Browse files Browse the repository at this point in the history
  • Loading branch information
minhthuc2502 committed Mar 4, 2024
1 parent 1da6b31 commit ac8f7ae
Show file tree
Hide file tree
Showing 2 changed files with 44 additions and 1 deletion.
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,7 @@ The project is production-oriented and comes with [backward compatibility guaran
* **Lightweight on disk**<br/>Quantization can make the models 4 times smaller on disk with minimal accuracy loss.
* **Simple integration**<br/>The project has few dependencies and exposes simple APIs in [Python](https://opennmt.net/CTranslate2/python/overview.html) and C++ to cover most integration needs.
* **Configurable and interactive decoding**<br/>[Advanced decoding features](https://opennmt.net/CTranslate2/decoding.html) allow autocompleting a partial sequence and returning alternatives at a specific location in the sequence.
* **Support tensor parallelism for distributed inference.

Some of these features are difficult to achieve with standard deep learning frameworks and are the motivation for this project.

Expand Down
44 changes: 43 additions & 1 deletion docs/parallel.md
Original file line number Diff line number Diff line change
Expand Up @@ -42,8 +42,50 @@ Parallelization with multiple Python threads is possible because all computation
```

## Model and tensor parallelism
Models as the [`Translator`](python/ctranslate2.Translator.rst) and [`Generator`](python/ctranslate2.Generator.rst) can be split into multiple GPUs different.

This comment has been minimized.

Copy link
@panosk

panosk Mar 4, 2024

Contributor
This is very helpful when the model is too big to be load in only 1 GPU.

This comment has been minimized.

Copy link
@panosk

panosk Mar 4, 2024

Contributor
  • This is very useful when the model is too big to be loaded in only 1 GPU.

These types of parallelism are not yet implemented in CTranslate2.
```python
translator = ctranslate2.Translator(model_path, device="cuda", tensor_parallel=True)
```

Setup environment:
* Install [open-mpi](https://www.open-mpi.org/)
* Configure open-mpi by creating the config file like ``hostfile``:
```bash
[ipaddress or dns] slots=nbGPU1
[other ipaddress or dns] slots=NbGPU2
```

Run:
* Run the application in multiprocess to using tensor parallel:

This comment has been minimized.

Copy link
@panosk

panosk Mar 4, 2024

Contributor
  • Run the application in multiprocess to use tensor parallel:
```bash
mpirun -np nbGPUExpected -hostfile hostfile python3 script
```

If you're trying to run the tensor parallelism in multiple machine, there are additional configuration is needed:

This comment has been minimized.

Copy link
@panosk

panosk Mar 4, 2024

Contributor

If you're trying to use tensor parallelism in multiple machines, some additional configuration is needed:

* Make sure Master and Slave can connect to each other as a pair with ssh + pubkey
* Export all necessary environment variables from Master to Slave like the example below:
```bash
mpirun -x VIRTUAL_ENV_PROMPT -x PATH -x VIRTUAL_ENV -x _ -x LD_LIBRARY_PATH -np nbGPUExpected -hostfile hostfile python3 script
```
Read more [open-mpi docs](https://www.open-mpi.org/doc/) for more information.

* In this mode, the application will be run in multiprocess. We can filter out the master process by using:

This comment has been minimized.

Copy link
@panosk

panosk Mar 4, 2024

Contributor
  • In this mode, the application will run in multiprocess. We can filter out the master process by using:
```python
if ctranslate2.MpiInfo.getCurRank() == 0:
print(...)
```

```{note}
Running model in tensor parallel mode in one machine can boost the performance but if running the model shared between multiple
machine could be slower because of the latency in the connectivity.

This comment has been minimized.

Copy link
@panosk

panosk Mar 4, 2024

Contributor

"...the model shared between multiple machines could be..."

```

```{note}
In mode tensor parallel, `inter_threads` is always supported to run multiple workers. Otherwise, `device_index` no longer has any effect
because tensor parallel mode will check only available gpus on the system and number of gpu that you want to use.

This comment has been minimized.

Copy link
@panosk

panosk Mar 4, 2024

Contributor

"...will check only for available gpus on the system and the number of gpus you want to use."

```

## Asynchronous execution

Expand Down

3 comments on commit ac8f7ae

@panosk
Copy link
Contributor

@panosk panosk commented on ac8f7ae Mar 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @minhthuc2502 ,

Some minor improvements in wording.

@minhthuc2502
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @panosk for your help. Sorry for my poor english

@panosk
Copy link
Contributor

@panosk panosk commented on ac8f7ae Mar 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Np, thanks for your work on this!

Please sign in to comment.