Skip to content

Commit 8696139

Browse files
authored
Update the README (#112)
Update the README to reflect the current state of the CLI. Additionally lets temporarily (until v0.2.0) direct users to install from the main branch.
1 parent 09655a4 commit 8696139

File tree

1 file changed

+12
-13
lines changed

1 file changed

+12
-13
lines changed

README.md

Lines changed: 12 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -36,12 +36,12 @@ Scale Efficiently: Evaluate and Optimize Your LLM Deployments for Real-World Inf
3636
Before installing, ensure you have the following prerequisites:
3737

3838
- OS: Linux or MacOS
39-
- Python: 3.8 – 3.12
39+
- Python: 3.9 – 3.13
4040

41-
GuideLLM is available on PyPI and is installed using `pip`:
41+
GuideLLM can be installed using pip:
4242

4343
```bash
44-
pip install guidellm
44+
pip install git+https://github.com/neuralmagic/guidellm.git
4545
```
4646

4747
For detailed installation instructions and requirements, see the [Installation Guide](https://github.com/neuralmagic/guidellm/tree/main/docs/install.md).
@@ -79,11 +79,11 @@ For more information on starting a TGI server, see the [TGI Documentation](https
7979
To run a GuideLLM evaluation, use the `guidellm` command with the appropriate model name and options on the server hosting the model or one with network access to the deployment server. For example, to evaluate the full performance range of the previously deployed Llama 3.1 8B model, run the following command:
8080

8181
```bash
82-
guidellm \
83-
--target "http://localhost:8000/v1" \
82+
guidellm benchmark \
83+
--target "http://localhost:8000" \
8484
--model "neuralmagic/Meta-Llama-3.1-8B-Instruct-quantized.w4a16" \
85-
--data-type emulated \
86-
--data "prompt_tokens=512,generated_tokens=128"
85+
--rate-type sweep \
86+
--data "prompt_tokens=256,output_tokens=128"
8787
```
8888

8989
The above command will begin the evaluation and output progress updates similar to the following (if running on a different server, be sure to update the target!): <img src= "https://raw.githubusercontent.com/neuralmagic/guidellm/main/docs/assets/sample-benchmarks.gif"/>
@@ -92,7 +92,8 @@ Notes:
9292

9393
- The `--target` flag specifies the server hosting the model. In this case, it is a local vLLM server.
9494
- The `--model` flag specifies the model to evaluate. The model name should match the name of the model deployed on the server
95-
- By default, GuideLLM will run a `sweep` of performance evaluations across different request rates, each lasting 120 seconds and the results are printed out to the terminal.
95+
- The `--rate-type` flag specifies what load generation pattern GuideLLM will use when sending requests to the server. If `sweep` is specified GuideLLM will run multiple performance evaluations across different request rates.
96+
- By default GuideLLM will run over a fixed workload of 1000 requests configurable by `--max-requests`. If `--max-seconds` is set GuideLLM will instead run over a fixed time.
9697

9798
#### 3. Analyze the Results
9899

@@ -126,11 +127,9 @@ Some typical configurations for the CLI include:
126127
- `--rate-type throughput`: Throughput runs requests in a throughput manner, sending requests as fast as possible.
127128
- `--rate-type constant`: Constant runs requests at a constant rate. Specify the request rate per second with the `--rate` argument. For example, `--rate 10` or multiple rates with `--rate 10 --rate 20 --rate 30`.
128129
- `--rate-type poisson`: Poisson draws from a Poisson distribution with the mean at the specified rate, adding some real-world variance to the runs. Specify the request rate per second with the `--rate` argument. For example, `--rate 10` or multiple rates with `--rate 10 --rate 20 --rate 30`.
129-
- `--data-type`: The data to use for the benchmark. Options include `emulated`, `transformers`, and `file`.
130-
- `--data-type emulated`: Emulated supports an EmulationConfig in string or file format for the `--data` argument to generate fake data. Specify the number of prompt tokens at a minimum and optionally the number of output tokens and other parameters for variance in the length. For example, `--data "prompt_tokens=128"`, `--data "prompt_tokens=128,generated_tokens=128" `, or `--data "prompt_tokens=128,prompt_tokens_variance=10" `.
131-
- `--data-type file`: File supports a file path or URL to a file for the `--data` argument. The file should contain data encoded as a CSV, JSONL, TXT, or JSON/YAML file with a single prompt per line for CSV, JSONL, and TXT or a list of prompts for JSON/YAML. For example, `--data "data.txt"` where data.txt contents are `"prompt1\nprompt2\nprompt3"`.
132-
- `--data-type transformers`: Transformers supports a dataset name or file path for the `--data` argument. For example, `--data "neuralmagic/LLM_compression_calibration"`.
133-
- `--max-seconds`: The maximum number of seconds to run each benchmark. The default is 120 seconds.
130+
- `--rate-type concurrent`: Concurrent runs requests at a fixed concurrency. When a requests completes it is immediately replaced with a new request to maintain the set concurrency. Specify the request concurrency with `--rate`.
131+
- `--data`: A hugging face dataset name or arguments to generate a synthetic dataset.
132+
- `--max-seconds`: The maximum number of seconds to run each benchmark.
134133
- `--max-requests`: The maximum number of requests to run in each benchmark.
135134

136135
For a complete list of supported CLI arguments, run the following command:

0 commit comments

Comments
 (0)