Skip to content

Commit

Permalink
[tnx] version bump Neuron SDK and Optimum (deepjavalibrary#1826)
Browse files Browse the repository at this point in the history
  • Loading branch information
tosterberg authored Apr 28, 2024
1 parent 63f2f45 commit 1139bde
Show file tree
Hide file tree
Showing 3 changed files with 22 additions and 5 deletions.
11 changes: 11 additions & 0 deletions engines/python/src/main/java/ai/djl/python/engine/Connection.java
Original file line number Diff line number Diff line change
Expand Up @@ -186,6 +186,10 @@ static String[] getPythonStartCmd(PyEnv pyEnv, Model model, int workerId, int po
// TODO: re-map logic device once neuron fixed bug
pyEnv.addEnv("NEURON_RT_VISIBLE_CORES", visibleCores);
logger.info("Set NEURON_RT_VISIBLE_CORES={}", visibleCores);

String neuronThreads = getNeuronThreads(tensorParallelDegree);
pyEnv.addEnv("OMP_NUM_THREADS", neuronThreads);
logger.info("Set OMP_NUM_THREADS={}", neuronThreads);
}
boolean uds = Epoll.isAvailable() || KQueue.isAvailable();
String[] args = new String[12];
Expand Down Expand Up @@ -231,6 +235,13 @@ private static String getNeuronVisibleCores(int deviceId, int tensorParallelDegr
return String.valueOf(deviceId);
}

private static String getNeuronThreads(int tensorParallelDegree) {
if (tensorParallelDegree > 0) {
return String.valueOf(tensorParallelDegree * 2);
}
return String.valueOf(1);
}

void connect() throws InterruptedException {
EventLoopGroup group = PyEnv.getEventLoopGroup();

Expand Down
8 changes: 4 additions & 4 deletions serving/docker/pytorch-inf2.Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -14,17 +14,17 @@ ARG djl_version=0.28.0~SNAPSHOT
ARG torch_version=2.1.2
ARG torchvision_version=0.16.2
ARG python_version=3.9
ARG neuronsdk_version=2.18.1
ARG neuronsdk_version=2.18.2
ARG torch_neuronx_version=2.1.2.2.1.0
ARG transformers_neuronx_version=0.10.0.360
ARG neuronx_distributed_version=0.7.0
ARG neuronx_cc_version=2.13.68.0
ARG neuronx_cc_version=2.13.72.0
ARG protobuf_version=3.19.6
ARG transformers_version=4.36.2
ARG accelerate_version=0.23.0
ARG diffusers_version=0.26.1
ARG pydantic_version=2.6.1
ARG optimum_neuron_version=0.0.20
ARG optimum_neuron_version=0.0.21
ARG vllm_wheel="https://publish.djl.ai/neuron_vllm/vllm-nightly-py3-none-any.whl"
EXPOSE 8080

Expand Down Expand Up @@ -75,7 +75,7 @@ RUN mkdir -p /opt/djl/bin && cp scripts/telemetry.sh /opt/djl/bin && \
neuronx-cc==${neuronx_cc_version} torch-neuronx==${torch_neuronx_version} transformers-neuronx==${transformers_neuronx_version} \
neuronx_distributed==${neuronx_distributed_version} protobuf==${protobuf_version} sentencepiece jinja2 \
diffusers==${diffusers_version} opencv-contrib-python-headless Pillow --extra-index-url=https://pip.repos.neuron.amazonaws.com \
pydantic==${pydantic_version} optimum optimum-neuron==${optimum_neuron_version} tiktoken blobfile && \
pydantic==${pydantic_version} optimum optimum-neuron==${optimum_neuron_version} tiktoken blobfile \
torchvision==${torchvision_version} && \
scripts/install_s5cmd.sh x64 && \
scripts/patch_oss_dlc.sh python && \
Expand Down
8 changes: 7 additions & 1 deletion serving/docs/lmi/user_guides/tnx_user_guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@ The model architectures that are tested daily for LMI Transformers-NeuronX (in C

- LLAMA
- Mistral
- Mixtral
- GPT-NeoX
- GPT-J
- Bloom
Expand All @@ -32,8 +33,9 @@ The model architectures that are tested daily for LMI Transformers-NeuronX (in C
- GPT BigCode (`bigcode/starcoder`, `bigcode/gpt_bigcode-santacoder`, etc.)
- GPT-J (`EleutherAI/gpt-j-6b`, `nomic-ai/gpt4all-j`, etc.)
- GPT-NeoX (`EleutherAI/gpt-neox-20b`, `databricks/dolly-v2-12b`, `stabilityai/stablelm-tuned-alpha-7b`, etc.)
- LLaMA & LLaMA-2 (`meta-llama/Llama-2-70b-hf`, `lmsys/vicuna-13b-v1.3`, `young-geng/koala`, `openlm-research/open_llama_13b`, etc.)
- LLaMA, LLaMA-2, LLaMA-3 (`meta-llama/Llama-2-70b-hf`, `lmsys/vicuna-13b-v1.3`, `meta-llama/Meta-Llama-3-70B`, `openlm-research/open_llama_13b`, etc.)
- Mistral (`mistralai/Mistral-7B-v0.1`, `mistralai/Mistral-7B-Instruct-v0.1`, etc.)
- Mixtral (`mistralai/Mixtral-8x7B-Instruct-v0.1`)
- OPT (`facebook/opt-66b`, `facebook/opt-iml-max-30b`, etc.)

We will add more model support for the future versions to have them tested. Please feel free to [file us an issue](https://github.com/deepjavalibrary/djl-serving/issues/new/choose) for more model coverage in CI.
Expand Down Expand Up @@ -99,3 +101,7 @@ In that situation, there is nothing LMI can do until the issue is fixed in the b
| option.group_query_attention | >= 0.26.0 | Pass Through | Enable K/V cache sharding for llama and mistral models types based on various [strategies](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/transformers-neuronx/transformers-neuronx-developer-guide.html#grouped-query-attention-gqa-support-beta) | `shard-over-heads` Default: `None` |
| option.enable_mixed_precision_accumulation | >= 0.26.0 | Pass Through | Turn this on for LLAMA 70B model to achieve better accuracy. | `true` Default: `None` |

## Advanced Multi-Model Inference Considerations

When using the LMI Transformers-NeuronX for multimodel inference endpoints you may need to limit the number of threads available to each model.
Follow this [guide](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/transformers-neuronx/transformers-neuronx-developer-guide.html?highlight=omp_num#running-inference-with-multiple-models) when setting the correct number of threads to avoid race conditions. LMI Transformers-NeuronX in its standard configuration will set threads equal to two times the tensor parallel degree as the `OMP_NUM_THREADS` values.

0 comments on commit 1139bde

Please sign in to comment.