Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
21 changes: 21 additions & 0 deletions LICENSE
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
MIT License

Copyright (c) 2025 Nillion

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
154 changes: 110 additions & 44 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,68 +1,134 @@
# nilAI

Copy the `.env.sample` to `.env` and replace the value of the `HUGGINGFACE_API_TOKEN` with the appropriate value. The `HUGGINGFACE_API_TOKEN` is used to determine whether a user has permission to access certain models. For example, for Llama models, you usually need to have requested access to the model on its [Hugging Face page](https://huggingface.co/meta-llama/Llama-3.2-1B).
## Overview
nilAI is a platform designed to run on Confidential VMs with Trusted Execution Environments (TEEs). It ensures secure deployment and management of multiple AI models across different environments, providing a unified API interface for accessing various AI models with robust user management and model lifecycle handling.

There are two ways to deploy **nilAI**. The recommended way is to use docker-compose as it is the easiest and most straightforward.
## Prerequisites

## Docker
- Docker
- Docker Compose
- Hugging Face API Token (for accessing certain models)

For development environments with
## Configuration

1. **Environment Setup**
- Copy the `.env.sample` file to `.env`
- Replace `HUGGINGFACE_API_TOKEN` with your Hugging Face API token
- Obtain token by requesting access on the specific model's [Hugging Face page](https://huggingface.co/meta-llama/Llama-3.2-1B)

## Deployment Options

### 1. Docker Compose Deployment (Recommended)

#### Development Environment
```shell
docker compose -f docker-compose.yml \
-f docker-compose.dev.yml \
-f docker/compose/docker-compose.llama-3b-gpu.yml \
-f docker/compose/docker-compose.llama-8b-gpu.yml \
up --build
-f docker-compose.dev.yml \
-f docker/compose/docker-compose.llama-3b-gpu.yml \
-f docker/compose/docker-compose.llama-8b-gpu.yml \
-f docker/compose/docker-compose.dolphin-8b-gpu.yml \
-f docker/compose/docker-compose.deepseek-14b-gpu.yml \
up --build
```

For production environments:
#### Production Environment
```shell
docker compose -f docker-compose.yml \
-f docker-compose.prod.yml \
-f docker/compose/docker-compose.llama-3b-gpu.yml \
-f docker/compose/docker-compose.llama-8b-gpu.yml \
up --build
-f docker-compose.prod.yml \
-f docker/compose/docker-compose.llama-3b-gpu.yml \
-f docker/compose/docker-compose.llama-8b-gpu.yml \
-f docker/compose/docker-compose.dolphin-8b-gpu.yml \
-f docker/compose/docker-compose.deepseek-14b-gpu.yml \
up -d --build
```

## Manual Deployment
**Note**: Remove lines for models you do not wish to deploy.

**nilAI** consists of the following components:
- **API Frontend**: Receives user requests and handles them appropriately. For model requests, it forwards them to the appropriate backend model.
- **Two Databases**:
- **SQLite**: The main registry of users on the platform. This will be changed as we move to more production-ready environments. It tracks which users are allowed on the platform, their API keys, and their usage.
- **etcd3**: A distributed key-value database used in Kubernetes. It creates key-value pairs with a lifetime. When a key-value pair's lifetime expires, it is automatically removed. Models register their address information on the etcd3 database with a lifetime and keep this lifetime alive. If a model ever disconnects due to an error, the database removes the entry, and the API Frontend no longer advertises that model.
- **Models**: There may be zero or more model deployments. Model deployments contain a basic API that responds in the same format to the `/v1/chat/completions` endpoint. The `Model` class defines how models connect to the database and manage their lifecycle.
### 2. Manual Deployment

To deploy the components, first create the `etcd3` instance. The easiest way is to expose it with Docker:
#### Components

```shell
# This command runs in the background. If it fails, you may already be running etcd-server on ports 2379 and 2380.
docker run -d --name etcd-server -p 2379:2379 -p 2380:2380 -e ALLOW_NONE_AUTHENTICATION=yes bitnami/etcd:latest
```
- **API Frontend**: Handles user requests and routes model interactions
- **Databases**:
- **SQLite**: User registry and access management
- **etcd3**: Distributed key-value store for model lifecycle management

Run the **nilAI** API server:
```shell
# Shell 1
## For development environment (auto reloads on file changes):
uv run fastapi dev nilai-api/src/nilai_api/__main__.py --port 8080
## For production environment:
uv run fastapi run nilai-api/src/nilai_api/__main__.py --port 8080
```
#### Setup Steps

Run the **nilAI** Llama 3.2 1B model. For different models, adapt the command below:
```shell
# Shell 2
## For development environment (auto reloads on file changes):
uv run fastapi dev nilai-models/src/nilai_models/models/llama_1b_cpu/__init__.py
## For production environment:
uv run fastapi run nilai-models/src/nilai_models/models/llama_1b_cpu/__init__.py
```
1. **Start etcd3 Instance**
```shell
docker run -d --name etcd-server \
-p 2379:2379 -p 2380:2380 \
-e ALLOW_NONE_AUTHENTICATION=yes \
bitnami/etcd:latest

docker run -d --name redis \
-p 6379:6379 \
redis:latest

docker run -d --name postgres \
-e POSTGRES_USER=user \
-e POSTGRES_PASSWORD=<ASECUREPASSWORD> \
-e POSTGRES_DB=yourdb \
-p 5432:5432 \
postgres:latest
```

2. **Run API Server**
```shell
# Development Environment
uv run fastapi dev nilai-api/src/nilai_api/__main__.py --port 8080

# Production Environment
uv run fastapi run nilai-api/src/nilai_api/__main__.py --port 8080
```

## Developer Instructions
3. **Run Model Instances**
```shell
# Example: Llama 3.2 1B Model
# Development Environment
uv run fastapi dev nilai-models/src/nilai_models/models/llama_1b_cpu/__init__.py

If you are developping, you can use `pre-commit` configurations to ensure make the development smoother and not having to wait for CI checks. These are executed before you commit, and perform automatic changes to format your code.
# Production Environment
uv run fastapi run nilai-models/src/nilai_models/models/llama_1b_cpu/__init__.py
```

## Developer Workflow

### Code Quality and Formatting

Install pre-commit hooks to automatically format code and run checks:

You can install those with:
```shell
uv run pre-commit install
```

## Model Lifecycle Management

- Models register themselves in the etcd3 database
- Registration includes address information with an auto-expiring lifetime
- If a model disconnects, it is automatically removed from the available models

## Security

- Hugging Face API token controls model access
- SQLite database manages user permissions
- Distributed architecture allows for flexible security configurations

## Troubleshooting

- Ensure Hugging Face API token is valid
- Check etcd3 and Docker container logs for connection issues
- Verify network ports are not blocked or in use

## Contributing

1. Fork the repository
2. Create a feature branch
3. Install pre-commit hooks
4. Make your changes
5. Submit a pull request

## License

[Add your project's license information here]
6 changes: 4 additions & 2 deletions docker/compose/docker-compose.deepseek-14b-gpu.yml
Original file line number Diff line number Diff line change
Expand Up @@ -20,12 +20,14 @@ services:
depends_on:
etcd:
condition: service_healthy
llama_8b_gpu:
condition: service_healthy
command: >
--model deepseek-ai/DeepSeek-R1-Distill-Qwen-14B
--gpu-memory-utilization 0.4
--gpu-memory-utilization 0.39
--max-model-len 10000
--tensor-parallel-size 1
--uvicorn-log-level WARNING
--uvicorn-log-level warning
environment:
- SVC_HOST=deepseek_14b_gpu
- SVC_PORT=8000
Expand Down
6 changes: 4 additions & 2 deletions docker/compose/docker-compose.dolphin-8b-gpu.yml
Original file line number Diff line number Diff line change
Expand Up @@ -20,14 +20,16 @@ services:
depends_on:
etcd:
condition: service_healthy
llama_3b_gpu:
condition: service_healthy
command: >
--model cognitivecomputations/Dolphin3.0-Llama3.1-8B
--gpu-memory-utilization 0.5
--gpu-memory-utilization 0.21
--max-model-len 10000
--tensor-parallel-size 1
--enable-auto-tool-choice
--tool-call-parser llama3_json
--uvicorn-log-level WARNING
--uvicorn-log-level warning
environment:
- SVC_HOST=dolphin_8b_gpu
- SVC_PORT=8000
Expand Down
8 changes: 5 additions & 3 deletions docker/compose/docker-compose.llama-3b-gpu.yml
Original file line number Diff line number Diff line change
Expand Up @@ -20,14 +20,16 @@ services:
depends_on:
etcd:
condition: service_healthy
deepseek_14b_gpu:
condition: service_healthy
command: >
--model meta-llama/Llama-3.2-3B-Instruct
--gpu-memory-utilization 0.3
--max-model-len 10000
--gpu-memory-utilization 0.085
--max-model-len 4300
--tensor-parallel-size 1
--enable-auto-tool-choice
--tool-call-parser llama3_json
--uvicorn-log-level WARNING
--uvicorn-log-level warning
environment:
- SVC_HOST=llama_3b_gpu
- SVC_PORT=8000
Expand Down
4 changes: 2 additions & 2 deletions docker/compose/docker-compose.llama-8b-gpu.yml
Original file line number Diff line number Diff line change
Expand Up @@ -22,12 +22,12 @@ services:
condition: service_healthy
command: >
--model meta-llama/Llama-3.1-8B-Instruct
--gpu-memory-utilization 0.5
--gpu-memory-utilization 0.21
--max-model-len 10000
--tensor-parallel-size 1
--enable-auto-tool-choice
--tool-call-parser llama3_json
--uvicorn-log-level WARNING
--uvicorn-log-level warning
environment:
- SVC_HOST=llama_8b_gpu
- SVC_PORT=8000
Expand Down
2 changes: 1 addition & 1 deletion nilai-api/gunicorn.conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
bind = ["0.0.0.0:8080", "0.0.0.0:8443"]

# Set the number of workers (2)
workers = 10
workers = 50

# Set the number of threads per worker (16)
threads = 1
Expand Down
5 changes: 3 additions & 2 deletions nilai-api/src/nilai_api/config/mainnet.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,8 +5,9 @@
# there can be 45 + 30 + 15 + 5 = 85 concurrent requests in the system
MODEL_CONCURRENT_RATE_LIMIT = {
"meta-llama/Llama-3.2-1B-Instruct": 45,
"meta-llama/Llama-3.2-3B-Instruct": 30,
"meta-llama/Llama-3.1-8B-Instruct": 15,
"meta-llama/Llama-3.2-3B-Instruct": 50,
"meta-llama/Llama-3.1-8B-Instruct": 30,
"cognitivecomputations/Dolphin3.0-Llama3.1-8B": 30,
"deepseek-ai/DeepSeek-R1-Distill-Qwen-14B": 5,
}

Expand Down
1 change: 1 addition & 0 deletions nilai-api/src/nilai_api/config/testnet.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@
"meta-llama/Llama-3.2-1B-Instruct": 10,
"meta-llama/Llama-3.2-3B-Instruct": 10,
"meta-llama/Llama-3.1-8B-Instruct": 5,
"cognitivecomputations/Dolphin3.0-Llama3.1-8B": 5,
"deepseek-ai/DeepSeek-R1-Distill-Qwen-14B": 5,
}

Expand Down