Merge branch 'master' into nproc

lxning · web-flow · commit 2cad1478f843 · 2023-10-08T11:12:16.000-07:00
diff --git a/README.md b/README.md
@@ -77,6 +77,7 @@ Refer to [torchserve docker](docker/README.md) for details.
 
 
 ## 🏆 Highlighted Examples
+* [Serving Llama 2 with TorchServe](examples/LLM/llama2/README.md)
 * [Chatbot with Llama 2 on Mac 🦙💬](examples/LLM/llama2/chat_app)
 * [🤗 HuggingFace Transformers](examples/Huggingface_Transformers) with a [Better Transformer Integration/ Flash Attention & Xformer Memory Efficient ](examples/Huggingface_Transformers#Speed-up-inference-with-Better-Transformer)
 * [Model parallel inference](examples/Huggingface_Transformers#model-parallelism)
diff --git a/examples/LLM/llama2/README.md b/examples/LLM/llama2/README.md
@@ -0,0 +1,38 @@
+# Llama 2: Next generation of Meta's Language Model
+![Llama 2](./images/llama.png)
+
+TorchServe supports serving Llama 2 in a number of ways. The examples covered in this document range from someone new to TorchServe learning how to serve Llama 2 with an app, to an advanced user of TorchServe using micro batching and streaming response with Llama 2
+
+## 🦙💬 Llama 2 Chatbot
+
+### [Example Link](https://github.com/pytorch/serve/tree/master/examples/LLM/llama2/chat_app)
+
+This example shows how to deploy a llama2 chat app using TorchServe.
+We use [streamlit](https://github.com/streamlit/streamlit) to create the app
+
+This example is  using [llama-cpp-python](https://github.com/abetlen/llama-cpp-python).
+
+You can run this example on your laptop to understand how to use TorchServe, how to scale up/down TorchServe backend workers and play around with batch_size to see its effect on inference time
+
+![Chatbot Architecture](./chat_app/screenshots/architecture.png)
+
+## Llama 2 with HuggingFace
+
+### [Example Link](https://github.com/pytorch/serve/tree/master/examples/large_models/Huggingface_accelerate/llama2)
+
+This example shows how to serve Llama 2 - 70b model with limited resource using [HuggingFace](https://huggingface.co/meta-llama/Llama-2-70b-chat-hf). It shows the following optimizations
+    1) HuggingFace `accelerate`. This option can be activated with `low_cpu_mem_usage=True`.
+    2) Quantization from [`bitsandbytes`](https://github.com/TimDettmers/bitsandbytes)  using `load_in_8bit=True`
+The model is first created on the Meta device (with empty weights) and the state dict is then loaded inside it (shard by shard in the case of a sharded checkpoint).
+
+## Llama 2 on Inferentia
+
+### [Example Link](https://github.com/pytorch/serve/tree/master/examples/large_models/inferentia2/llama2)
+
+### [PyTorch Blog](https://pytorch.org/blog/high-performance-llama/)
+
+This example shows how to serve the [Llama 2](https://huggingface.co/meta-llama) model on [AWS Inferentia2](https://aws.amazon.com/ec2/instance-types/inf2/) for text completion with [micro batching](https://github.com/pytorch/serve/tree/96450b9d0ab2a7290221f0e07aea5fda8a83efaf/examples/micro_batching) and [streaming response](https://github.com/pytorch/serve/blob/96450b9d0ab2a7290221f0e07aea5fda8a83efaf/docs/inference_api.md#curl-example-1) support.
+
+Inferentia2 uses [Neuron SDK](https://aws.amazon.com/machine-learning/neuron/) which is built on top of PyTorch XLA stack. For large model inference [`transformers-neuronx`](https://github.com/aws-neuron/transformers-neuronx) package is used that takes care of model partitioning and running inference.
+
+![Inferentia 2 Software Stack](./images/software_stack_inf2.jpg)
diff --git a/examples/LLM/llama2/chat_app/client_app.py b/examples/LLM/llama2/chat_app/client_app.py
@@ -6,7 +6,6 @@
 # App title
 st.set_page_config(page_title="🦙💬 Llama 2 Chatbot")
 
-# Replicate Credentials
 with st.sidebar:
     st.title("🦙💬 Llama 2 Chatbot")
 
diff --git a/examples/LLM/llama2/images/llama.png b/examples/LLM/llama2/images/llama.png
diff --git a/examples/LLM/llama2/images/software_stack_inf2.jpg b/examples/LLM/llama2/images/software_stack_inf2.jpg
diff --git a/examples/pt2/README.md b/examples/pt2/README.md
@@ -13,7 +13,7 @@ python ts_scripts/install_dependencies.py --cuda=cu118
 pip install torchserve torch-model-archiver
 ```
 
-## Package your model
+## torch.compile
 
 PyTorch 2.0 supports several compiler backends and you pick which one you want by passing in an optional file `model_config.yaml` during your model packaging
 
@@ -34,10 +34,10 @@ The exact same approach works with any other model, what's going on is the below
 opt_mod = torch.compile(mod)
 # 2. Train the optimized module
 # ....
-# 3. Save the original module (weights are shared)
-torch.save(model, "model.pt")
+# 3. Save the opt module state dict
+torch.save(opt_model.state_dict(), "model.pt")
 
-# 4. Load the non optimized model
+# 4. Reload the model
 mod = torch.load(model)
 
 # 5. Compile the module and then run inferences with it
@@ -46,6 +46,47 @@ opt_mod = torch.compile(mod)
 
 torchserve takes care of 4 and 5 for you while the remaining steps are your responsibility. You can do the exact same thing on the vast majority of TIMM or HuggingFace models.
 
-## Next steps
+## torch.export.export
+
+Export your model from a training script, keep in mind that an exported model cannot have graph breaks.
+
+```python
+import io
+import torch
+
+class MyModule(torch.nn.Module):
+    def forward(self, x):
+        return x + 10
+
+ep = torch.export.export(MyModule(), (torch.randn(5),))
+
+# Save to file
+# torch.export.save(ep, 'exported_program.pt2')
+extra_files = {'foo.txt': b'bar'.decode('utf-8')}
+torch.export.save(ep, 'exported_program.pt2', extra_files=extra_files)
+
+# Save to io.BytesIO buffer
+buffer = io.BytesIO()
+torch.export.save(ep, buffer)
+```
+
+Serve your exported model from a custom handler
+
+```python
+# from initialize()
+ep = torch.export.load('exported_program.pt2')
+
+with open('exported_program.pt2', 'rb') as f:
+    buffer = io.BytesIO(f.read())
+buffer.seek(0)
+ep = torch.export.load(buffer)
+
+# Make sure everything looks good
+print(ep)
+print(extra_files['foo.txt'])
+
+# from inference()
+print(ep(torch.randn(5)))
+```
+
 
-For now PyTorch 2.0 has mostly been focused on accelerating training so production grade applications should instead opt for TensorRT for accelerated inference performance which is also natively supported in torchserve. We just wanted to make it really easy for users to experiment with the PyTorch 2.0 stack. You can learn more here https://github.com/pytorch/serve/blob/master/docs/performance_guide.md
diff --git a/requirements/torch_cu121_linux.txt b/requirements/torch_cu121_linux.txt
@@ -1,5 +1,5 @@
 #pip install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu121
---extra-index-url https://download.pytorch.org/whl/test/cu121
+--extra-index-url https://download.pytorch.org/whl/cu121
 -r torch_common.txt
 torch==2.1.0+cu121; sys_platform == 'linux'
 torchvision==0.16.0+cu121; sys_platform == 'linux'
diff --git a/requirements/torch_darwin.txt b/requirements/torch_darwin.txt
@@ -1,5 +1,5 @@
 #pip install torch torchvision torchaudio
---extra-index-url https://download.pytorch.org/whl/test/cpu
+--extra-index-url https://download.pytorch.org/whl/cpu
 -r torch_common.txt
 torch==2.1.0; sys_platform == 'darwin'
 torchvision==0.16.0; sys_platform == 'darwin'
diff --git a/requirements/torch_linux.txt b/requirements/torch_linux.txt
@@ -1,5 +1,5 @@
 #pip3 install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cpu
---extra-index-url https://download.pytorch.org/whl/test/cpu
+--extra-index-url https://download.pytorch.org/whl/cpu
 -r torch_common.txt
 torch==2.1.0+cpu; sys_platform == 'linux'
 torchvision==0.16.0+cpu; sys_platform == 'linux'
diff --git a/requirements/torch_windows.txt b/requirements/torch_windows.txt
@@ -1,5 +1,5 @@
 #pip install torch torchvision torchaudio
---extra-index-url https://download.pytorch.org/whl/test/cpu
+--extra-index-url https://download.pytorch.org/whl/cpu
 -r torch_common.txt
 torch==2.1.0; sys_platform == 'win32'
 torchvision==0.16.0; sys_platform == 'win32'
diff --git a/ts_scripts/spellcheck_conf/wordlist.txt b/ts_scripts/spellcheck_conf/wordlist.txt
@@ -1117,4 +1117,4 @@ sharding
 quantized
 Chatbot
 LLM
-
+bitsandbytes