Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Docs] Update example for Llama3 #2169

Merged
merged 1 commit into from
Apr 19, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 8 additions & 10 deletions docs/deploy/cli.rst
Original file line number Diff line number Diff line change
Expand Up @@ -54,13 +54,13 @@ To run a model with MLC LLM in any platform, you can either:
**Option 1: Use model prebuilts**

To run ``mlc_llm``, you can specify the Huggingface MLC prebuilt model repo path with the prefix ``HF://``.
For example, to run the MLC Llama 2 7B Q4F16_1 model (`Repo link <https://huggingface.co/mlc-ai/Llama-2-7b-chat-hf-q4f16_1-MLC>`_),
simply use ``HF://mlc-ai/Llama-2-7b-chat-hf-q4f16_1-MLC``. The model weights and library will be downloaded
For example, to run the MLC Llama 3 8B Q4F16_1 model (`Repo link <https://huggingface.co/mlc-ai/Llama-3-8B-Instruct-q4f16_1-MLC>`_),
simply use ``HF://mlc-ai/Llama-3-8B-Instruct-q4f16_1-MLC``. The model weights and library will be downloaded
automatically from Huggingface.

.. code:: shell

mlc_llm chat HF://mlc-ai/Llama-2-7b-chat-hf-q4f16_1-MLC --device "cuda:0" --overrides context_window_size=1024
mlc_llm chat HF://mlc-ai/Llama-3-8B-Instruct-q4f16_1-MLC --device "cuda:0" --overrides context_window_size=1024

.. code:: shell

Expand All @@ -74,13 +74,11 @@ automatically from Huggingface.
Note: Separate stop words in the `stop` option with commas (,).
Multi-line input: Use escape+enter to start a new line.

[INST]: What's the meaning of life
[/INST]:
Ah, a question that has puzzled philosophers and theologians for centuries! The meaning
of life is a deeply personal and subjective topic, and there are many different
perspectives on what it might be. However, here are some possible answers that have been
proposed by various thinkers and cultures:
...
user: What's the meaning of life
assistant:
What a profound and intriguing question! While there's no one definitive answer, I'd be happy to help you explore some perspectives on the meaning of life.

The concept of the meaning of life has been debated and...


**Option 2: Use locally compiled model weights and libraries**
Expand Down
10 changes: 5 additions & 5 deletions docs/get_started/introduction.rst
Original file line number Diff line number Diff line change
Expand Up @@ -37,7 +37,7 @@ You can run MLC chat through a one-liner command:

.. code:: bash

mlc_llm chat HF://mlc-ai/Llama-2-7b-chat-hf-q4f16_1-MLC
mlc_llm chat HF://mlc-ai/Llama-3-8B-Instruct-q4f16_1-MLC

It may take 1-2 minutes for the first time running this command.
After waiting, this command launch a chat interface where you can enter your prompt and chat with the model.
Expand Down Expand Up @@ -91,7 +91,7 @@ You can save the code below into a Python file and run it.
from mlc_llm import LLMEngine

# Create engine
model = "HF://mlc-ai/Llama-2-7b-chat-hf-q4f16_1-MLC"
model = "HF://mlc-ai/Llama-3-8B-Instruct-q4f16_1-MLC"
engine = LLMEngine(model)

# Run chat completion in OpenAI API.
Expand Down Expand Up @@ -142,7 +142,7 @@ for OpenAI chat completion requests. The server can be launched in command line

.. code:: bash

mlc_llm serve HF://mlc-ai/Llama-2-7b-chat-hf-q4f16_1-MLC
mlc_llm serve HF://mlc-ai/Llama-3-8B-Instruct-q4f16_1-MLC

The server is hooked at ``http://127.0.0.1:8000`` by default, and you can use ``--host`` and ``--port``
to set a different host and port.
Expand All @@ -154,7 +154,7 @@ we can open a new shell and send a cURL request via the following command:
curl -X POST \
-H "Content-Type: application/json" \
-d '{
"model": "HF://mlc-ai/Llama-2-7b-chat-hf-q4f16_1-MLC",
"model": "HF://mlc-ai/Llama-3-8B-Instruct-q4f16_1-MLC",
"messages": [
{"role": "user", "content": "Hello! Our project is MLC LLM. What is the name of our project?"}
]
Expand Down Expand Up @@ -280,7 +280,7 @@ environments (e.g. SteamDeck).

.. code:: bash

mlc_llm chat HF://mlc-ai/Llama-2-7b-chat-hf-q4f16_1-MLC --device vulkan
mlc_llm chat HF://mlc-ai/Llama-3-8B-Instruct-q4f16_1-MLC --device vulkan

The same core LLM runtime engine powers all the backends, enabling the same model to be deployed across backends as
long as they fit within the memory and computing budget of the corresponding hardware backend.
Expand Down
8 changes: 4 additions & 4 deletions docs/get_started/quick_start.rst
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ It is recommended to have at least 6GB free VRAM to run it.
from mlc_llm import LLMEngine

# Create engine
model = "HF://mlc-ai/Llama-2-7b-chat-hf-q4f16_1-MLC"
model = "HF://mlc-ai/Llama-3-8B-Instruct-q4f16_1-MLC"
engine = LLMEngine(model)

# Run chat completion in OpenAI API.
Expand Down Expand Up @@ -57,7 +57,7 @@ It is recommended to have at least 6GB free VRAM to run it.

.. code:: shell

mlc_llm serve HF://mlc-ai/Llama-2-7b-chat-hf-q4f16_1-MLC
mlc_llm serve HF://mlc-ai/Llama-3-8B-Instruct-q4f16_1-MLC

**Send requests to server.** When the server is ready (showing ``INFO: Uvicorn running on http://127.0.0.1:8000 (Press CTRL+C to quit)``),
open a new shell and send a request via the following command:
Expand All @@ -67,7 +67,7 @@ It is recommended to have at least 6GB free VRAM to run it.
curl -X POST \
-H "Content-Type: application/json" \
-d '{
"model": "HF://mlc-ai/Llama-2-7b-chat-hf-q4f16_1-MLC",
"model": "HF://mlc-ai/Llama-3-8B-Instruct-q4f16_1-MLC",
"messages": [
{"role": "user", "content": "Hello! Our project is MLC LLM. What is the name of our project?"}
]
Expand All @@ -94,7 +94,7 @@ It is recommended to have at least 6GB free VRAM to run it.

.. code:: bash

mlc_llm chat HF://mlc-ai/Llama-2-7b-chat-hf-q4f16_1-MLC
mlc_llm chat HF://mlc-ai/Llama-3-8B-Instruct-q4f16_1-MLC


If you are using windows/linux/steamdeck and would like to use vulkan,
Expand Down
2 changes: 1 addition & 1 deletion docs/prebuilt_models.rst
Original file line number Diff line number Diff line change
Expand Up @@ -68,7 +68,7 @@ For more, please see :ref:`the CLI page <deploy-cli>`, and the :ref:`the Python

.. code:: shell

mlc_llm chat HF://mlc-ai/Llama-2-7b-chat-hf-q4f16_1-MLC
mlc_llm chat HF://mlc-ai/Llama-3-8B-Instruct-q4f16_1-MLC


To run the model with Python API, see :ref:`the Python page <deploy-python-chat-module>` (all other downloading steps are the same as CLI).
Expand Down
2 changes: 1 addition & 1 deletion examples/python/sample_mlc_engine.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
from mlc_llm import LLMEngine

# Create engine
model = "HF://mlc-ai/Llama-2-7b-chat-hf-q4f16_1-MLC"
model = "HF://mlc-ai/Llama-3-8B-Instruct-q4f16_1-MLC"
engine = LLMEngine(model)

# Run chat completion in OpenAI API.
Expand Down
Loading