neuralmagic
diff --git a/‎README.md
Lines changed: 71 additions & 62 deletions b/‎README.md
Lines changed: 71 additions & 62 deletions
diff --git a/‎docs/architecture.md
Lines changed: 97 additions & 1 deletion b/‎docs/architecture.md
Lines changed: 97 additions & 1 deletion
diff --git a/‎docs/assets/sample-benchmarks.gif
-2.07 MB b/‎docs/assets/sample-benchmarks.gif
-2.07 MB
diff --git a/‎docs/assets/sample-output-end.png
-412 KB b/‎docs/assets/sample-output-end.png
-412 KB
diff --git a/‎docs/assets/sample-output-start.png
-490 KB b/‎docs/assets/sample-output-start.png
-490 KB
diff --git a/‎docs/assets/sample-output.png
165 KB b/‎docs/assets/sample-output.png
165 KB
diff --git a/‎docs/backends.md
Lines changed: 45 additions & 0 deletions b/‎docs/backends.md
Lines changed: 45 additions & 0 deletions
diff --git a/‎docs/install.md
Lines changed: 87 additions & 1 deletion b/‎docs/install.md
Lines changed: 87 additions & 1 deletion
diff --git a/‎docs/metrics.md
Lines changed: 112 additions & 0 deletions b/‎docs/metrics.md
Lines changed: 112 additions & 0 deletions
@@ -1 +1,97 @@
-# Coming Soon
+# GuideLLM Architecture
+
+GuideLLM is designed to evaluate and optimize large language model (LLM) deployments by simulating real-world inference workloads. The architecture is modular, enabling flexibility and scalability. Below is an overview of the core components and their interactions.
+
+```
++------------------+       +------------------+       +------------------+
+|   DatasetCreator | --->  |   RequestLoader  | --->  |     Scheduler    |
++------------------+       +------------------+       +------------------+
+                                                    /         |          \
+                                                   /          |           \
+                                                  /           |            \
+                                                 v            v             v
+                                       +------------------+ +------------------+
+                                       | RequestsWorker   | | RequestsWorker   |
+                                       +------------------+ +------------------+
+                                                 |                     |
+                                                 v                     v
+                                       +------------------+ +------------------+
+                                       |     Backend      | |     Backend      |
+                                       +------------------+ +------------------+
+                                                 |                     |
+                                                 v                     v
+                                       +---------------------------------------+
+                                       |         BenchmarkAggregator           |
+                                       +---------------------------------------+
+                                                 |
+                                                 v
+                                       +------------------+
+                                       |    Benchmarker   |
+                                       +------------------+
+```
+
+## Core Components
+
+### 1. **Backend**
+
+The `Backend` is an abstract interface for interacting with generative AI backends. It is responsible for processing requests and generating results. GuideLLM supports OpenAI-compatible HTTP servers, such as vLLM, as backends.
+
+- **Responsibilities:**
+  - Accept requests from the `RequestsWorker`.
+  - Generate responses for text or chat completions.
+  - Validate backend readiness and available models.
+
+### 2. **RequestLoader**
+
+The `RequestLoader` handles sourcing data from an iterable and generating requests for the backend. It ensures that data is properly formatted and ready for processing.
+
+- **Responsibilities:**
+  - Load data from datasets or synthetic sources.
+  - Generate requests in a format compatible with the backend.
+
+### 3. **DatasetCreator**
+
+The `DatasetCreator` is responsible for loading data sources and converting them into Hugging Face (HF) dataset items. These items can then be streamed by the `RequestLoader`.
+
+- **Responsibilities:**
+  - Load datasets from local files, Hugging Face datasets, or synthetic data.
+  - Convert data into a format compatible with the `RequestLoader`.
+
+### 4. **Scheduler**
+
+The `Scheduler` manages the scheduling of requests to the backend. It uses multiprocessing and multithreading with asyncio to minimize overheads and maximize throughput.
+
+- **Responsibilities:**
+  - Schedule requests to the backend.
+  - Manage queues for requests and results.
+  - Ensure efficient utilization of resources.
+
+### 5. **RequestsWorker**
+
+The `RequestsWorker` is a worker process that pulls requests from a queue, processes them using the backend, and sends the results back to the scheduler.
+
+- **Responsibilities:**
+  - Process requests from the scheduler.
+  - Interact with the backend to generate results.
+  - Return results to the scheduler.
+
+### 6. **Benchmarker**
+
+The `Benchmarker` wraps around multiple invocations of the `Scheduler`, one for each benchmark. It aggregates results using a `BenchmarkAggregator` and compiles them into a `Benchmark` once complete.
+
+- **Responsibilities:**
+  - Manage multiple benchmarks.
+  - Aggregate results from the scheduler.
+  - Compile results into a final benchmark report.
+
+### 7. **BenchmarkAggregator**
+
+The `BenchmarkAggregator` is responsible for storing and compiling results from the benchmarks.
+
+- **Responsibilities:**
+  - Aggregate results from multiple benchmarks.
+  - Compile results into a `Benchmark` object.
+
+## Component Interactions
+
+The following diagram illustrates the relationships between the core components:
@@ -0,0 +1,45 @@
+# Supported Backends for GuideLLM
+
+GuideLLM is designed to work with OpenAI-compatible HTTP servers, enabling seamless integration with a variety of generative AI backends. This compatibility ensures that users can evaluate and optimize their large language model (LLM) deployments efficiently. While the current focus is on OpenAI-compatible servers, we welcome contributions to expand support for other backends, including additional server implementations and Python interfaces.
+
+## Supported Backends
+
+### OpenAI-Compatible HTTP Servers
+
+GuideLLM supports OpenAI-compatible HTTP servers, which provide a standardized API for interacting with LLMs. This includes popular implementations such as [vLLM](https://github.com/vllm-project/vllm) and [Text Generation Inference (TGI)](https://github.com/huggingface/text-generation-inference). These servers allow GuideLLM to perform evaluations, benchmarks, and optimizations with minimal setup.
+
+## Examples for Spinning Up Compatible Servers
+
+### 1. vLLM
+
+[vLLM](https://github.com/vllm-project/vllm) is a high-performance OpenAI-compatible server designed for efficient LLM inference. It supports a variety of models and provides a simple interface for deployment.
+
+First ensure you have vLLM installed (`pip install vllm`), and then run the following command to start a vLLM server with a Llama 3.1 8B quantized model:
+
+```bash
+vllm serve "neuralmagic/Meta-Llama-3.1-8B-Instruct-quantized.w4a16"
+```
+
+For more information on starting a vLLM server, see the [vLLM Documentation](https://docs.vllm.ai/en/latest/serving/openai_compatible_server.html).
+
+### 2. Text Generation Inference (TGI)
+
+[Text Generation Inference (TGI)](https://github.com/huggingface/text-generation-inference) is another OpenAI-compatible server that supports a wide range of models, including those hosted on Hugging Face. TGI is optimized for high-throughput and low-latency inference.
+
+To start a TGI server with a Llama 3.1 8B model using Docker, run the following command:
+
+```bash
+docker run --gpus 1 -ti --shm-size 1g --ipc=host --rm -p 8080:80 \
+  -e MODEL_ID=meta-llama/Meta-Llama-3.1-8B-Instruct \
+  -e NUM_SHARD=1 \
+  -e MAX_INPUT_TOKENS=4096 \
+  -e MAX_TOTAL_TOKENS=6000 \
+  -e HF_TOKEN=$(cat ~/.cache/huggingface/token) \
+  ghcr.io/huggingface/text-generation-inference:2.2.0
+```
+
+For more information on starting a TGI server, see the [TGI Documentation](https://huggingface.co/docs/text-generation-inference/index).
+
+## Expanding Backend Support
+
+GuideLLM is an open platform, and we encourage contributions to extend its backend support. Whether it's adding new server implementations, integrating with Python-based backends, or enhancing existing capabilities, your contributions are welcome. For more details on how to contribute, see the [CONTRIBUTING.md](https://github.com/neuralmagic/guidellm/blob/main/CONTRIBUTING.md) file.
@@ -1 +1,87 @@
-# Coming Soon
+# Installation Guide for GuideLLM
+
+GuideLLM can be installed using several methods depending on your requirements. Below are the detailed instructions for each installation pathway.
+
+## Prerequisites
+
+Before installing GuideLLM, ensure you have the following prerequisites:
+
+- **Operating System:** Linux or MacOS
+
+- **Python Version:** 3.9 – 3.13
+
+- **Pip Version:** Ensure you have the latest version of pip installed. You can upgrade pip using the following command:
+
+  ```bash
+  python -m pip install --upgrade pip
+  ```
+
+## Installation Methods
+
+### 1. Install the Latest Release from PyPI
+
+The simplest way to install GuideLLM is via pip from the Python Package Index (PyPI):
+
+```bash
+pip install guidellm
+```
+
+This will install the latest stable release of GuideLLM.
+
+### 2. Install a Specific Version from PyPI
+
+If you need a specific version of GuideLLM, you can specify the version number during installation. For example, to install version `0.2.0`:
+
+```bash
+pip install guidellm==0.2.0
+```
+
+### 3. Install from Source on the Main Branch
+
+To install the latest development version of GuideLLM from the main branch, use the following command:
+
+```bash
+pip install git+https://github.com/neuralmagic/guidellm.git
+```
+
+This will clone the repository and install GuideLLM directly from the main branch.
+
+### 4. Install from a Specific Branch
+
+If you want to install GuideLLM from a specific branch (e.g., `feature-branch`), use the following command:
+
+```bash
+pip install git+https://github.com/neuralmagic/guidellm.git@feature-branch
+```
+
+Replace `feature-branch` with the name of the branch you want to install.
+
+### 5. Install from a Local Clone
+
+If you have cloned the GuideLLM repository locally and want to install it, navigate to the repository directory and run:
+
+```bash
+pip install .
+```
+
+Alternatively, for development purposes, you can install it in editable mode:
+
+```bash
+pip install -e .
+```
+
+This allows you to make changes to the source code and have them reflected immediately without reinstalling.
+
+## Verifying the Installation
+
+After installation, you can verify that GuideLLM is installed correctly by running:
+
+```bash
+guidellm --help
+```
+
+This should display the installed version of GuideLLM.
+
+## Troubleshooting
+
+If you encounter any issues during installation, ensure that your Python and pip versions meet the prerequisites. For further assistance, please refer to the [GitHub Issues](https://github.com/neuralmagic/guidellm/issues) page or consult the [Documentation](https://github.com/neuralmagic/guidellm/tree/main/docs).
@@ -0,0 +1,112 @@
+# Metrics Documentation
+
+GuideLLM provides a comprehensive set of metrics to evaluate and optimize the performance of large language model (LLM) deployments. These metrics are designed to help users understand the behavior of their models under various conditions, identify bottlenecks, and make informed decisions about scaling and resource allocation. Below, we outline the key metrics measured by GuideLLM, their definitions, use cases, and how they can be interpreted.
+
+## Request Status Metrics
+
+### Successful, Incomplete, and Error Requests
+
+- **Successful Requests**: The number of requests that were completed successfully without any errors.
+- **Incomplete Requests**: The number of requests that were started but not completed, often due to timeouts or interruptions.
+- **Error Requests**: The number of requests that failed due to errors, such as invalid inputs or server issues.
+
+These metrics provide a breakdown of the overall request statuses, helping users identify the reliability and stability of their LLM deployment.
+
+### Requests Made
+
+- **Definition**: The total number of requests made during a benchmark run, broken down by status (successful, incomplete, error).
+- **Use Case**: Helps gauge the workload handled by the system and identify the proportion of requests that were successful versus those that failed or were incomplete.
+
+## Token Metrics
+
+### Prompt Tokens and Counts
+
+- **Definition**: The number of tokens in the input prompts sent to the LLM.
+- **Use Case**: Useful for understanding the complexity of the input data and its impact on model performance.
+
+### Output Tokens and Counts
+
+- **Definition**: The number of tokens generated by the LLM in response to the input prompts.
+- **Use Case**: Helps evaluate the model's output length and its correlation with latency and resource usage.
+
+## Performance Metrics
+
+### Request Rate (Requests Per Second)
+
+- **Definition**: The number of requests processed per second.
+- **Use Case**: Indicates the throughput of the system and its ability to handle concurrent workloads.
+
+### Request Concurrency
+
+- **Definition**: The number of requests being processed simultaneously.
+- **Use Case**: Helps evaluate the system's capacity to handle parallel workloads.
+
+### Output Tokens Per Second
+
+- **Definition**: The average number of output tokens generated per second as a throughput metric across all requests.
+- **Use Case**: Provides insights into the server's performance and efficiency in generating output tokens.
+
+### Total Tokens Per Second
+
+- **Definition**: The combined rate of prompt and output tokens processed per second as a throughput metric across all requests.
+- **Use Case**: Provides insights into the server's overall performance and efficiency in processing both prompt and output tokens.
+
+### Request Latency
+
+- **Definition**: The time taken to process a single request, from start to finish.
+- **Use Case**: A critical metric for evaluating the responsiveness of the system.
+
+### Time to First Token (TTFT)
+
+- **Definition**: The time taken to generate the first token of the output.
+- **Use Case**: Indicates the initial response time of the model, which is crucial for user-facing applications.
+
+### Inter-Token Latency (ITL)
+
+- **Definition**: The average time between generating consecutive tokens in the output, excluding the first token.
+- **Use Case**: Helps assess the smoothness and speed of token generation.
+
+### Time Per Output Token
+
+- **Definition**: The average time taken to generate each output token, including the first token.
+- **Use Case**: Provides a detailed view of the model's token generation efficiency.
+
+## Statistical Summaries
+
+GuideLLM provides detailed statistical summaries for each of the above metrics using the `StatusDistributionSummary` and `DistributionSummary` models. These summaries include the following statistics:
+
+### Summary Statistics
+
+- **Mean**: The average value of the metric.
+- **Median**: The middle value of the metric when sorted.
+- **Mode**: The most frequently occurring value of the metric.
+- **Variance**: The measure of how much the values of the metric vary.
+- **Standard Deviation (Std Dev)**: The square root of the variance, indicating the spread of the values.
+- **Min**: The minimum value of the metric.
+- **Max**: The maximum value of the metric.
+- **Count**: The total number of data points for the metric.
+- **Total Sum**: The sum of all values for the metric.
+
+### Percentiles
+
+GuideLLM calculates a comprehensive set of percentiles for each metric, including:
+
+- **0.1th Percentile (p001)**: The value below which 0.1% of the data falls.
+- **1st Percentile (p01)**: The value below which 1% of the data falls.
+- **5th Percentile (p05)**: The value below which 5% of the data falls.
+- **10th Percentile (p10)**: The value below which 10% of the data falls.
+- **25th Percentile (p25)**: The value below which 25% of the data falls.
+- **75th Percentile (p75)**: The value below which 75% of the data falls.
+- **90th Percentile (p90)**: The value below which 90% of the data falls.
+- **95th Percentile (p95)**: The value below which 95% of the data falls.
+- **99th Percentile (p99)**: The value below which 99% of the data falls.
+- **99.9th Percentile (p999)**: The value below which 99.9% of the data falls.
+
+### Use Cases for Statistical Summaries
+
+- **Mean and Median**: Provide a central tendency of the metric values.
+- **Variance and Std Dev**: Indicate the variability and consistency of the metric.
+- **Min and Max**: Highlight the range of the metric values.
+- **Percentiles**: Offer a detailed view of the distribution, helping identify outliers and performance at different levels of service.
+
+By combining these metrics and statistical summaries, GuideLLM enables users to gain a deep understanding of their LLM deployments, optimize performance, and ensure scalability and cost-effectiveness.