forked from theroyallab/tabbyAPI
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Tree: Update documentation and configs
Signed-off-by: kingbri <bdashore3@proton.me>
- Loading branch information
Showing
5 changed files
with
88 additions
and
95 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,133 +1,97 @@ | ||
# TabbyAPI | ||
|
||
# tabbyAPI | ||
A FastAPI based application that allows for generating text using an LLM (large language model) using the [exllamav2 backend](https://github.com/turboderp/exllamav2). | ||
|
||
tabbyAPI is a FastAPI-based application that provides an API for generating text using a language model. This README provides instructions on how to launch and use the tabbyAPI. | ||
## Disclaimer | ||
|
||
## Prerequisites | ||
|
||
Before you get started, ensure you have the following prerequisites installed on your system: | ||
|
||
- Python 3.x (with pip) | ||
- Dependencies listed in `requirements.txt` | ||
This API is still in the alpha phase. There may be bugs and changes down the line. Please be aware that you might need to reinstall dependencies if needed. | ||
|
||
## Installation | ||
|
||
1. Clone the repository to your local machine: | ||
|
||
git clone https://github.com/Splice86/tabbyAPI.git | ||
## Prerequisites | ||
|
||
To get started, make sure you have the following installed on your system: | ||
|
||
2. Navigate to the project directory: | ||
- Python 3.x (preferably 3.11) with pip | ||
|
||
cd tabbyAPI | ||
- CUDA 12.1 or 11.8 | ||
|
||
NOTE: For Flash Attention 2 to work on Windows, CUDA 12.1 **must** be installed! | ||
|
||
3. Create a virtual environment (optional but recommended): | ||
## Installing | ||
|
||
python -m venv venv | ||
source venv/bin/activate | ||
1. Clone this repository to your machine: `git clone https://github.com/theroyallab/tabbyAPI` | ||
|
||
2. Navigate to the project directory: `cd tabbyAPI` | ||
|
||
4. Install project dependencies using pip: | ||
3. Create a virtual environment: | ||
|
||
1. `python -m venv venv` | ||
|
||
2. On Windows: `.\venv\Scripts\activate`. On Linux: `source venv/bin/activate` | ||
|
||
pip install -r requirements.txt | ||
4. Install torch using the instructions found [here](https://pytorch.org/get-started/locally/) | ||
|
||
5. Install an exllamav2 wheel from [here](https://github.com/turboderp/exllamav2/releases): | ||
|
||
1. Find the version that corresponds with your cuda and python version. For example, a wheel with `cu121` and `cp311` corresponds to CUDA 12.1 and python 3.11 | ||
|
||
5. Install exllamav2 to your venv | ||
6. Install the other requirements via: `pip install -r requirements.txt` | ||
|
||
git clone https://github.com/turboderp/exllamav2.git | ||
## Configuration | ||
|
||
cd exllamav2 | ||
Copy over `config_sample.yml` to `config.yml`. All the fields are commented, so make sure to read the descriptions and comment out or remove fields that you don't need. | ||
|
||
pip install -r requirements.txt | ||
## Launching the Application | ||
|
||
python setup.py install | ||
1. Make sure you are in the project directory and entered into the venv | ||
|
||
2. Run the tabbyAPI application: `python main.py` | ||
|
||
## API Documentation | ||
|
||
## Launch the tabbyAPI Application | ||
Docs can be accessed once you launch the API at `http://<your-IP>:<your-port>/docs` | ||
|
||
To start the tabbyAPI application, follow these steps: | ||
If you use the default YAML config, it's accessible at `http://localhost:5000/docs` | ||
|
||
1. Ensure you are in the project directory and the virtual environment is activated (if used). | ||
## Authentication | ||
|
||
2. Run the tabbyAPI application: | ||
TabbyAPI uses an API key and admin key to authenticate a user's request. On first launch of the API, a file called `api_tokens.yml` will be generated with fields for the admin and API keys. | ||
|
||
If you feel that the keys have been compromised, delete `api_tokens.yml` and the API will generate new keys for you. | ||
|
||
python main.py | ||
API keys and admin keys can be provided via: | ||
|
||
3. The tabbyAPI application should now be running. You can access it by opening a web browser and navigating to `http://localhost:8000` (if running locally). | ||
- `x-api-key` and `x-admin-key` respectively | ||
|
||
## Usage | ||
- `Authorization` with the `Bearer ` prefix | ||
|
||
The tabbyAPI application provides the following endpoint: | ||
DO NOT share your admin key unless you want someone else to load/unload a model from your system! | ||
|
||
- '/v1/model' Retrieves information about the currently loaded model. | ||
- '/v1/model/load' Loads a new model based on provided data and model configuration. | ||
- '/v1/model/unload' Unloads the currently loaded model from the system. | ||
- '/v1/completions' Use this endpoint to generate text based on the provided input data. | ||
#### Authentication Requrirements | ||
|
||
### Example Request (using `curl`) | ||
All routes require an API key except for the following which require an **admin** key | ||
|
||
curl -X POST \ | ||
-H "Content-Type: application/json" \ | ||
-H "Authorization: Bearer 2261702e8a220c6c4671a264cd1236ce" \ | ||
-d '{ | ||
"model": "airoboros-mistral2.2-7b-exl2", | ||
"prompt": ["A tabby","is"], | ||
"stream": true, | ||
"top_p": 0.73, | ||
"stop": "[", | ||
"max_tokens": 360, | ||
"temperature": 0.8, | ||
"mirostat_mode": 2, | ||
"mirostat_tau": 5, | ||
"mirostat_eta": 0.1 | ||
}' \ | ||
http://127.0.0.1:8012/v1/completions | ||
- `/v1/model/load` | ||
|
||
- `/v1/model/unload` | ||
|
||
## Contributing | ||
|
||
### Parameter Guide | ||
If you have issues with the project: | ||
|
||
*note* This stuff still needs to be expanded and updated | ||
- Describe the issues in detail | ||
|
||
{ | ||
"model": "airoboros-mistral2.2-7b-exl2", | ||
"prompt": ["A tabby","is"], | ||
"stream": true, | ||
"top_p": 0.73, | ||
"stop": "[", | ||
"max_tokens": 360, | ||
"temperature": 0.8, | ||
"mirostat_mode": 2, | ||
"mirostat_tau": 5, | ||
"mirostat_eta": 0.1 | ||
} | ||
- If you have a feature request, please indicate it as such. | ||
|
||
Model: "airoboros-mistral2.2-7b-exl2" | ||
This specifies the specific language model being used. It's essential for the API to know which model to employ for generating responses. | ||
If you have a Pull Request | ||
|
||
Prompt: ["Hello there! My name is", "Brian", "and I am", "an AI"] | ||
The prompt *QUESTION* why is it a list of strings instead of a single string? | ||
Stream: true | ||
Whether the response should be streamed back or not. | ||
- Describe the pull request in detail, what, and why you are changing something | ||
|
||
Top_p: 0.73 | ||
cumulative probability threshold | ||
## Developers and Permissions | ||
|
||
Stop: "[" | ||
The stop parameter defines a string that stops the generation. | ||
Creators/Developers: | ||
|
||
Max_tokens: 360 | ||
This parameter determines the maximum number of tokens. | ||
- kingbri | ||
|
||
Temperature: 0.8 | ||
Temperature controls the randomness of the generated text. | ||
- Splice86 | ||
|
||
Mirostat_mode: 2 | ||
? | ||
Mirostat_tau: 5 | ||
? | ||
Mirostat_eta: 0.1 | ||
? | ||
- Turboderp |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,14 +1,39 @@ | ||
# Network options | ||
# Options for networking | ||
network: | ||
host: "0.0.0.0" | ||
port: 8012 | ||
# Only used if you want to initially load a model | ||
# The IP to host on (default: 127.0.0.1). | ||
# Use 0.0.0.0 to expose on all network adapters | ||
host: "127.0.0.1" | ||
|
||
# The port to host on (default: 5000) | ||
port: 5000 | ||
|
||
# Options for model overrides and loading | ||
model: | ||
model_dir: "D:/models" | ||
model_name: "airoboros-mistral2.2-7b-exl2" | ||
# Overrides the directory to look for models (default: "models") | ||
# Make sure to use forward slashes, even on Windows (or escape your backslashes). | ||
# model_dir: "your model directory path" | ||
|
||
# An initial model to load. Make sure the model is located in the model directory! | ||
# A model can be loaded later via the API. This does not have to be specified | ||
# model_name: "A model name" | ||
|
||
# The below parameters apply only if model_name is set | ||
|
||
# Maximum model context length (default: 4096) | ||
max_seq_len: 4096 | ||
gpu_split: "auto" | ||
|
||
# Automatically allocate resources to GPUs (default: True) | ||
gpu_split_auto: True | ||
|
||
# An integer array of GBs of vram to split between GPUs (default: []) | ||
# gpu_split: [20.6, 24] | ||
|
||
# Rope scaling parameters (default: 1.0) | ||
rope_scale: 1.0 | ||
rope_alpha: 1.0 | ||
|
||
# Disable Flash-attention 2. Recommended for GPUs lower than Nvidia's 3000 series. (default: False) | ||
no_flash_attention: False | ||
|
||
# Enable low vram optimizations in exllamav2 (default: False) | ||
low_mem: False |
Empty file.
Binary file not shown.