Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cache system documenation #44

Merged
merged 10 commits into from
Apr 25, 2023
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
Move file
  • Loading branch information
michaelbenayoun committed Apr 25, 2023
commit bb97a7b772590afa73d8c8b91cf2cd97f0d73bbe
72 changes: 0 additions & 72 deletions docs/source/cache_system.mdx
michaelbenayoun marked this conversation as resolved.
Outdated
Show resolved Hide resolved

This file was deleted.

79 changes: 66 additions & 13 deletions docs/source/guides/cache_system.mdx
Original file line number Diff line number Diff line change
@@ -1,19 +1,72 @@
<!---
Copyright 2023 The HuggingFace Team. All rights reserved.
<!--Copyright 2023 The HuggingFace Team. All rights reserved.

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0
http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
-->

# Trainium model cache
# Optimum Neuron Compilation Cache System

TODO:
As you might have noticed while training your models on Trainium instances, Neuron cores work on models compiled to the `neff` format.
It means that training requires one or multiple compilation step(s) that can take a long time.

With [`torch-neuronx`](https://github.com/aws-neuron/aws-neuron-samples/tree/master/torch-neuronx), a Neuron-compatible version of PyTorch, the compiled files are stored in a local directory once you performed compilation, usually `/var/tmp/neuron-compile-cache`. This way you do not need to perform compilation again when you start another training with similar parameters.

We go one step further with Optimum Neuron by plugging this local cache directory to the Hugging Face Hub. During training, the [`TrainiumTrainer`] will check if compilation files are available on the Hub and download them if they are found, allowing you to save both time and cost by skipping the compilation phase.

## How the caching system works

### Hash computation

Many factors can trigger compilation among which:

- The model
- The input shapes
- The precision of the model, full-precision or bf16
- The version of the compiler
- The number of Neuron cores used

Using all these parameters, we can compute a hash, and we can then compare hashes we have locally for our training session against hashes stored on the Hugging Face Hub, and act accordingly (download or push).


### Using a custom cache repo on the Hugging Face Hub

The official repo on the Hub storing compilation files is `aws-neuron/optimum-neuron-cache`. In this repo, we (the Optimum Neuron team) push compilation files for commonly used models so that it is publicly available and free to use for everyone. But there are two limitations:

1. You will not be able to push your own compiled files on this repo
2. It is public and you might want to use a private repo for private models

To alleviate that [you can create your own model repo on the Hub](https://huggingface.co/new) and set the environment variable `CUSTOM_CACHE_REPO`. For example, if you cache repo is called `michaelbenayoun/my_custom_cache_repo`, you just need to do:

```bash
CUSTOM_CACHE_REPO="michaelbenayoun/my_custom_cache_repo" torchrun ...
```

or:

```bash
export CUSTOM_CACHE_REPO="michaelbenayoun/my_custom_cache_repo"
torchrun ...
```

If you are [logged in the Hugging Face Hub](https://huggingface.co/docs/huggingface_hub/quick-start#login) on the Trainium instance, you will then be able to fetch files both from your custom cache repo, which can be public or private, and from the official cache repo, as well as pushing your compiled files to your custom cache repo.

### Cache system flow

<p align="center">
<img alt="Cache system flow" src="https://huggingface.co/datasets/optimum/documentation-images/resolve/main/neuron/cache_system_flow.jpg">
<br>
<em style="color: grey">Cache system flow</em>
</p>


At each the beginning of each training step, the [`TrainiumTrainer`] computes a `NeuronHash` and checks the cache repo(s) (official and custom) on the Hugging Face Hub to see if there are compiled files associated to this hash.
If that is the case, the files are downloaded directly to the local cache directory and no compilation is needed. Otherwise compilation is performed.


Just as for downloading compiled files, the [`TrainiumTrainer`] will keep track of the newly created compilation files at each training step, and upload them to the Hugging Face Hub at save time or when training ends. This assumes that you have writing access to the cache repo, otherwise nothing will be pushed.