Skip to content

Commit 3be2b5d

Browse files
authored
v0.2.0 Version Update and Docs Expansions (#118)
Update version to 0.2.0 and expand docs to account for the latest changes along with backfill of missing docs
1 parent 3b18458 commit 3be2b5d

18 files changed

+662
-86
lines changed

README.md

Lines changed: 71 additions & 62 deletions
Large diffs are not rendered by default.

docs/architecture.md

Lines changed: 97 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1,97 @@
1-
# Coming Soon
1+
# GuideLLM Architecture
2+
3+
GuideLLM is designed to evaluate and optimize large language model (LLM) deployments by simulating real-world inference workloads. The architecture is modular, enabling flexibility and scalability. Below is an overview of the core components and their interactions.
4+
5+
```
6+
+------------------+ +------------------+ +------------------+
7+
| DatasetCreator | ---> | RequestLoader | ---> | Scheduler |
8+
+------------------+ +------------------+ +------------------+
9+
/ | \
10+
/ | \
11+
/ | \
12+
v v v
13+
+------------------+ +------------------+
14+
| RequestsWorker | | RequestsWorker |
15+
+------------------+ +------------------+
16+
| |
17+
v v
18+
+------------------+ +------------------+
19+
| Backend | | Backend |
20+
+------------------+ +------------------+
21+
| |
22+
v v
23+
+---------------------------------------+
24+
| BenchmarkAggregator |
25+
+---------------------------------------+
26+
|
27+
v
28+
+------------------+
29+
| Benchmarker |
30+
+------------------+
31+
```
32+
33+
## Core Components
34+
35+
### 1. **Backend**
36+
37+
The `Backend` is an abstract interface for interacting with generative AI backends. It is responsible for processing requests and generating results. GuideLLM supports OpenAI-compatible HTTP servers, such as vLLM, as backends.
38+
39+
- **Responsibilities:**
40+
- Accept requests from the `RequestsWorker`.
41+
- Generate responses for text or chat completions.
42+
- Validate backend readiness and available models.
43+
44+
### 2. **RequestLoader**
45+
46+
The `RequestLoader` handles sourcing data from an iterable and generating requests for the backend. It ensures that data is properly formatted and ready for processing.
47+
48+
- **Responsibilities:**
49+
- Load data from datasets or synthetic sources.
50+
- Generate requests in a format compatible with the backend.
51+
52+
### 3. **DatasetCreator**
53+
54+
The `DatasetCreator` is responsible for loading data sources and converting them into Hugging Face (HF) dataset items. These items can then be streamed by the `RequestLoader`.
55+
56+
- **Responsibilities:**
57+
- Load datasets from local files, Hugging Face datasets, or synthetic data.
58+
- Convert data into a format compatible with the `RequestLoader`.
59+
60+
### 4. **Scheduler**
61+
62+
The `Scheduler` manages the scheduling of requests to the backend. It uses multiprocessing and multithreading with asyncio to minimize overheads and maximize throughput.
63+
64+
- **Responsibilities:**
65+
- Schedule requests to the backend.
66+
- Manage queues for requests and results.
67+
- Ensure efficient utilization of resources.
68+
69+
### 5. **RequestsWorker**
70+
71+
The `RequestsWorker` is a worker process that pulls requests from a queue, processes them using the backend, and sends the results back to the scheduler.
72+
73+
- **Responsibilities:**
74+
- Process requests from the scheduler.
75+
- Interact with the backend to generate results.
76+
- Return results to the scheduler.
77+
78+
### 6. **Benchmarker**
79+
80+
The `Benchmarker` wraps around multiple invocations of the `Scheduler`, one for each benchmark. It aggregates results using a `BenchmarkAggregator` and compiles them into a `Benchmark` once complete.
81+
82+
- **Responsibilities:**
83+
- Manage multiple benchmarks.
84+
- Aggregate results from the scheduler.
85+
- Compile results into a final benchmark report.
86+
87+
### 7. **BenchmarkAggregator**
88+
89+
The `BenchmarkAggregator` is responsible for storing and compiling results from the benchmarks.
90+
91+
- **Responsibilities:**
92+
- Aggregate results from multiple benchmarks.
93+
- Compile results into a `Benchmark` object.
94+
95+
## Component Interactions
96+
97+
The following diagram illustrates the relationships between the core components:

docs/assets/sample-benchmarks.gif

-2.07 MB
Loading

docs/assets/sample-output-end.png

-412 KB
Binary file not shown.

docs/assets/sample-output-start.png

-490 KB
Binary file not shown.

docs/assets/sample-output.png

165 KB
Loading

docs/backends.md

Lines changed: 45 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,45 @@
1+
# Supported Backends for GuideLLM
2+
3+
GuideLLM is designed to work with OpenAI-compatible HTTP servers, enabling seamless integration with a variety of generative AI backends. This compatibility ensures that users can evaluate and optimize their large language model (LLM) deployments efficiently. While the current focus is on OpenAI-compatible servers, we welcome contributions to expand support for other backends, including additional server implementations and Python interfaces.
4+
5+
## Supported Backends
6+
7+
### OpenAI-Compatible HTTP Servers
8+
9+
GuideLLM supports OpenAI-compatible HTTP servers, which provide a standardized API for interacting with LLMs. This includes popular implementations such as [vLLM](https://github.com/vllm-project/vllm) and [Text Generation Inference (TGI)](https://github.com/huggingface/text-generation-inference). These servers allow GuideLLM to perform evaluations, benchmarks, and optimizations with minimal setup.
10+
11+
## Examples for Spinning Up Compatible Servers
12+
13+
### 1. vLLM
14+
15+
[vLLM](https://github.com/vllm-project/vllm) is a high-performance OpenAI-compatible server designed for efficient LLM inference. It supports a variety of models and provides a simple interface for deployment.
16+
17+
First ensure you have vLLM installed (`pip install vllm`), and then run the following command to start a vLLM server with a Llama 3.1 8B quantized model:
18+
19+
```bash
20+
vllm serve "neuralmagic/Meta-Llama-3.1-8B-Instruct-quantized.w4a16"
21+
```
22+
23+
For more information on starting a vLLM server, see the [vLLM Documentation](https://docs.vllm.ai/en/latest/serving/openai_compatible_server.html).
24+
25+
### 2. Text Generation Inference (TGI)
26+
27+
[Text Generation Inference (TGI)](https://github.com/huggingface/text-generation-inference) is another OpenAI-compatible server that supports a wide range of models, including those hosted on Hugging Face. TGI is optimized for high-throughput and low-latency inference.
28+
29+
To start a TGI server with a Llama 3.1 8B model using Docker, run the following command:
30+
31+
```bash
32+
docker run --gpus 1 -ti --shm-size 1g --ipc=host --rm -p 8080:80 \
33+
-e MODEL_ID=meta-llama/Meta-Llama-3.1-8B-Instruct \
34+
-e NUM_SHARD=1 \
35+
-e MAX_INPUT_TOKENS=4096 \
36+
-e MAX_TOTAL_TOKENS=6000 \
37+
-e HF_TOKEN=$(cat ~/.cache/huggingface/token) \
38+
ghcr.io/huggingface/text-generation-inference:2.2.0
39+
```
40+
41+
For more information on starting a TGI server, see the [TGI Documentation](https://huggingface.co/docs/text-generation-inference/index).
42+
43+
## Expanding Backend Support
44+
45+
GuideLLM is an open platform, and we encourage contributions to extend its backend support. Whether it's adding new server implementations, integrating with Python-based backends, or enhancing existing capabilities, your contributions are welcome. For more details on how to contribute, see the [CONTRIBUTING.md](https://github.com/neuralmagic/guidellm/blob/main/CONTRIBUTING.md) file.

docs/install.md

Lines changed: 87 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1,87 @@
1-
# Coming Soon
1+
# Installation Guide for GuideLLM
2+
3+
GuideLLM can be installed using several methods depending on your requirements. Below are the detailed instructions for each installation pathway.
4+
5+
## Prerequisites
6+
7+
Before installing GuideLLM, ensure you have the following prerequisites:
8+
9+
- **Operating System:** Linux or MacOS
10+
11+
- **Python Version:** 3.9 – 3.13
12+
13+
- **Pip Version:** Ensure you have the latest version of pip installed. You can upgrade pip using the following command:
14+
15+
```bash
16+
python -m pip install --upgrade pip
17+
```
18+
19+
## Installation Methods
20+
21+
### 1. Install the Latest Release from PyPI
22+
23+
The simplest way to install GuideLLM is via pip from the Python Package Index (PyPI):
24+
25+
```bash
26+
pip install guidellm
27+
```
28+
29+
This will install the latest stable release of GuideLLM.
30+
31+
### 2. Install a Specific Version from PyPI
32+
33+
If you need a specific version of GuideLLM, you can specify the version number during installation. For example, to install version `0.2.0`:
34+
35+
```bash
36+
pip install guidellm==0.2.0
37+
```
38+
39+
### 3. Install from Source on the Main Branch
40+
41+
To install the latest development version of GuideLLM from the main branch, use the following command:
42+
43+
```bash
44+
pip install git+https://github.com/neuralmagic/guidellm.git
45+
```
46+
47+
This will clone the repository and install GuideLLM directly from the main branch.
48+
49+
### 4. Install from a Specific Branch
50+
51+
If you want to install GuideLLM from a specific branch (e.g., `feature-branch`), use the following command:
52+
53+
```bash
54+
pip install git+https://github.com/neuralmagic/guidellm.git@feature-branch
55+
```
56+
57+
Replace `feature-branch` with the name of the branch you want to install.
58+
59+
### 5. Install from a Local Clone
60+
61+
If you have cloned the GuideLLM repository locally and want to install it, navigate to the repository directory and run:
62+
63+
```bash
64+
pip install .
65+
```
66+
67+
Alternatively, for development purposes, you can install it in editable mode:
68+
69+
```bash
70+
pip install -e .
71+
```
72+
73+
This allows you to make changes to the source code and have them reflected immediately without reinstalling.
74+
75+
## Verifying the Installation
76+
77+
After installation, you can verify that GuideLLM is installed correctly by running:
78+
79+
```bash
80+
guidellm --help
81+
```
82+
83+
This should display the installed version of GuideLLM.
84+
85+
## Troubleshooting
86+
87+
If you encounter any issues during installation, ensure that your Python and pip versions meet the prerequisites. For further assistance, please refer to the [GitHub Issues](https://github.com/neuralmagic/guidellm/issues) page or consult the [Documentation](https://github.com/neuralmagic/guidellm/tree/main/docs).

docs/metrics.md

Lines changed: 112 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,112 @@
1+
# Metrics Documentation
2+
3+
GuideLLM provides a comprehensive set of metrics to evaluate and optimize the performance of large language model (LLM) deployments. These metrics are designed to help users understand the behavior of their models under various conditions, identify bottlenecks, and make informed decisions about scaling and resource allocation. Below, we outline the key metrics measured by GuideLLM, their definitions, use cases, and how they can be interpreted.
4+
5+
## Request Status Metrics
6+
7+
### Successful, Incomplete, and Error Requests
8+
9+
- **Successful Requests**: The number of requests that were completed successfully without any errors.
10+
- **Incomplete Requests**: The number of requests that were started but not completed, often due to timeouts or interruptions.
11+
- **Error Requests**: The number of requests that failed due to errors, such as invalid inputs or server issues.
12+
13+
These metrics provide a breakdown of the overall request statuses, helping users identify the reliability and stability of their LLM deployment.
14+
15+
### Requests Made
16+
17+
- **Definition**: The total number of requests made during a benchmark run, broken down by status (successful, incomplete, error).
18+
- **Use Case**: Helps gauge the workload handled by the system and identify the proportion of requests that were successful versus those that failed or were incomplete.
19+
20+
## Token Metrics
21+
22+
### Prompt Tokens and Counts
23+
24+
- **Definition**: The number of tokens in the input prompts sent to the LLM.
25+
- **Use Case**: Useful for understanding the complexity of the input data and its impact on model performance.
26+
27+
### Output Tokens and Counts
28+
29+
- **Definition**: The number of tokens generated by the LLM in response to the input prompts.
30+
- **Use Case**: Helps evaluate the model's output length and its correlation with latency and resource usage.
31+
32+
## Performance Metrics
33+
34+
### Request Rate (Requests Per Second)
35+
36+
- **Definition**: The number of requests processed per second.
37+
- **Use Case**: Indicates the throughput of the system and its ability to handle concurrent workloads.
38+
39+
### Request Concurrency
40+
41+
- **Definition**: The number of requests being processed simultaneously.
42+
- **Use Case**: Helps evaluate the system's capacity to handle parallel workloads.
43+
44+
### Output Tokens Per Second
45+
46+
- **Definition**: The average number of output tokens generated per second as a throughput metric across all requests.
47+
- **Use Case**: Provides insights into the server's performance and efficiency in generating output tokens.
48+
49+
### Total Tokens Per Second
50+
51+
- **Definition**: The combined rate of prompt and output tokens processed per second as a throughput metric across all requests.
52+
- **Use Case**: Provides insights into the server's overall performance and efficiency in processing both prompt and output tokens.
53+
54+
### Request Latency
55+
56+
- **Definition**: The time taken to process a single request, from start to finish.
57+
- **Use Case**: A critical metric for evaluating the responsiveness of the system.
58+
59+
### Time to First Token (TTFT)
60+
61+
- **Definition**: The time taken to generate the first token of the output.
62+
- **Use Case**: Indicates the initial response time of the model, which is crucial for user-facing applications.
63+
64+
### Inter-Token Latency (ITL)
65+
66+
- **Definition**: The average time between generating consecutive tokens in the output, excluding the first token.
67+
- **Use Case**: Helps assess the smoothness and speed of token generation.
68+
69+
### Time Per Output Token
70+
71+
- **Definition**: The average time taken to generate each output token, including the first token.
72+
- **Use Case**: Provides a detailed view of the model's token generation efficiency.
73+
74+
## Statistical Summaries
75+
76+
GuideLLM provides detailed statistical summaries for each of the above metrics using the `StatusDistributionSummary` and `DistributionSummary` models. These summaries include the following statistics:
77+
78+
### Summary Statistics
79+
80+
- **Mean**: The average value of the metric.
81+
- **Median**: The middle value of the metric when sorted.
82+
- **Mode**: The most frequently occurring value of the metric.
83+
- **Variance**: The measure of how much the values of the metric vary.
84+
- **Standard Deviation (Std Dev)**: The square root of the variance, indicating the spread of the values.
85+
- **Min**: The minimum value of the metric.
86+
- **Max**: The maximum value of the metric.
87+
- **Count**: The total number of data points for the metric.
88+
- **Total Sum**: The sum of all values for the metric.
89+
90+
### Percentiles
91+
92+
GuideLLM calculates a comprehensive set of percentiles for each metric, including:
93+
94+
- **0.1th Percentile (p001)**: The value below which 0.1% of the data falls.
95+
- **1st Percentile (p01)**: The value below which 1% of the data falls.
96+
- **5th Percentile (p05)**: The value below which 5% of the data falls.
97+
- **10th Percentile (p10)**: The value below which 10% of the data falls.
98+
- **25th Percentile (p25)**: The value below which 25% of the data falls.
99+
- **75th Percentile (p75)**: The value below which 75% of the data falls.
100+
- **90th Percentile (p90)**: The value below which 90% of the data falls.
101+
- **95th Percentile (p95)**: The value below which 95% of the data falls.
102+
- **99th Percentile (p99)**: The value below which 99% of the data falls.
103+
- **99.9th Percentile (p999)**: The value below which 99.9% of the data falls.
104+
105+
### Use Cases for Statistical Summaries
106+
107+
- **Mean and Median**: Provide a central tendency of the metric values.
108+
- **Variance and Std Dev**: Indicate the variability and consistency of the metric.
109+
- **Min and Max**: Highlight the range of the metric values.
110+
- **Percentiles**: Offer a detailed view of the distribution, helping identify outliers and performance at different levels of service.
111+
112+
By combining these metrics and statistical summaries, GuideLLM enables users to gain a deep understanding of their LLM deployments, optimize performance, and ensure scalability and cost-effectiveness.

0 commit comments

Comments
 (0)