Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
106 changes: 89 additions & 17 deletions website/docs/components/data-accelerators/duckdb.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,24 +5,41 @@ description: 'DuckDB Data Accelerator Documentation'
sidebar_position: 3
---

To use DuckDB as Data Accelerator, specify `duckdb` as the `engine` for acceleration.
The DuckDB Data Accelerator helps improve query performance by using [DuckDB](https://duckdb.org/), an embedded analytical database engine optimized for efficient data processing.

It supports in-memory and file-based operation modes, enabling workloads that exceed available memory and optionally providing persistent storage for datasets.

To enable DuckDB acceleration, set the dataset's `acceleration.engine` to `duckdb`:

```yaml
datasets:
- from: spice.ai:path.to.my_dataset
name: my_dataset
acceleration:
engine: duckdb
mode: file
```

## Configuration
## Modes

### Memory Mode

By default, DuckDB acceleration uses `mode: memory`, loading datasets into memory.

The DuckDB accelerator can be configured by providing the following `params`:
### File Mode

- `duckdb_file`: The name for the file to back the DuckDB database. If the file does not exist, it will be created. Only applies if `mode` is `file`.
- `duckdb_memory_limit`: The [DuckDB memory limit](https://duckdb.org/docs/stable/configuration/overview), e.g. '2GB', '512MiB' (expected: KB, MB, GB, TB for 1000^i units or KiB, MiB, GiB, TiB for 1024^i units).
When using `mode: file`, datasets are stored by default in a DuckDB file on disk in the `.spice/data` directory relative to the spicepod.yaml. Specify the `duckdb_file` parameter to store the DuckDB file in a different location. For datasets intended to be joined, set the same `duckdb_file` path for all related datasets.

Configuration `params` are provided in the `acceleration` section for a data store. Other common `acceleration` fields can be configured for DuckDB, see see [datasets](/docs/reference/spicepod/datasets.md).
## Configuration Parameters

DuckDB acceleration supports the following optional parameters under `acceleration.params`:

- `duckdb_file` (string, default:`.spice/data/accelerated_duckdb.db`): Path to the DuckDB database file. Applies if `mode` is set to `file`. If the file does not exist, Spice creates it automatically.
- `duckdb_memory_limit` (string, default: none): Limits DuckDB's memory usage for instance. Acceptable units are KB, MB, GB, TB (decimal: 1000^i) or KiB, MiB, GiB, TiB (binary: 1024^i). See [DuckDB memory limit documentation](https://duckdb.org/docs/stable/configuration/overview).

Refer to the [datasets configuration reference](/docs/reference/spicepod/datasets.md#acceleration) for additional supported fields.

### Example Configuration

```yaml
datasets:
Expand All @@ -36,23 +53,78 @@ datasets:
duckdb_memory_limit: '2GB'
```

:::warning[Limitations]
## Limitations

Consider the following limitations when using DuckDB acceleration:

- DuckDB does not support [enum and dictionary field types](https://duckdb.org/docs/sql/data_types/overview).
- DuckDB's maximum decimal precision is 38 digits. `Decimal256` (76 digits) is unsupported.
- Queries using `on_zero_results: use_source` cannot filter binary columns directly (e.g., `WHERE col_blob <> ''`). Instead, cast binary columns to another type (e.g., `WHERE CAST(col_blob AS TEXT) <> ''`).
- DuckDB indexes currently do not support spilling to disk.
- Hot-reloading dataset configurations while the Spice Runtime is active disables DuckDB query federation until the runtime restarts.

## Resource Considerations

Resource requirements depend on workload, dataset size, query complexity, and refresh modes.

### Memory

DuckDB manages memory through streaming execution, intermediate spilling, and buffer management. By default, each DuckDB instance (one per DuckDB file) uses up to 80% of available system memory. To control memory usage, set the `duckdb_memory_limit` parameter:

```yaml
datasets:
- from: spice.ai:path.to.my_dataset
name: my_dataset
acceleration:
engine: duckdb
mode: file
params:
duckdb_file: '/data/shared_duckdb_instance.db'
duckdb_memory_limit: '4GB'
```

Note that `duckdb_memory_limit` only limits the DuckDB instance it is set on, not the entire runtime process. Additionally, it does not cover all DuckDB operations, such as some insert operations. Index creation and scans are limited by the `duck_memory_limit` so ensure adequate memory is provisioned.

Allocate at least 30% more container/machine memory for the runtime process.

### Indexes and Memory

DuckDB indexes currently do not support spilling to disk. While index memory usage is registered through the buffer manager, index buffers are not managed by the buffer eviction mechanism. As a result, indexes may consume significant memory, impacting memory-intensive query performance.

Indexes are serialized to disk and loaded lazily upon database reopening, ensuring they do not affect database opening performance. Also consider index serialization when allocating disk storage.

- The DuckDB accelerator does not support enum and dictionary [field types](https://duckdb.org/docs/sql/data_types/overview).
- The DuckDB accelerator does not support `Decimal256` (76 digits), as it exceeds DuckDB's maximum Decimal width of 38 digits.
- Using the DuckDB accelerator with `on_zero_results: use_source` does not support using filters on binary columns when the query results in using the source connection, like `WHERE col_blob <> ''`. Cast the binary to another data type instead, like `WHERE CAST(col_blob AS TEXT) <> ''`.
- Updating a dataset with DuckDB acceleration while the Spice Runtime is running (hot-reload) will cause the DuckDB accelerator query federation to disable until the Runtime is restarted.
For more details, see DuckDB's [Indexes and Memory documentation](https://duckdb.org/docs/stable/guides/performance/indexing.html#indexes-and-memory).

:::
### CPU

Query performance, data load, and refresh operations scale with available CPU resources. Allocate sufficient CPU cores based on query complexity and concurrency.

### Storage

Ensure adequate disk space for temporary files, swap files, WAL files, and intermediate spilling. Monitor disk usage regularly and adjust storage capacity based on dataset growth and query patterns.

## Temporary Directory

The Spice runtime supports configuring a temporary directory for query and acceleration operations that spill to disk. By default, this is the directory of the `duckdb_file`.

Set the `runtime.temp_directory` parameter to specify a custom temporary directory. This can help distribute I/O operations across multiple volumes for improved throughput. For example, setting `runtime.temp_directory` to a high-IOPS volume separate from the DuckDB data file can improve performance for workloads exceeding available memory.

Example configuration:

```yaml
runtime:
temp_directory: /tmp/spice
```

:::warning[Memory Considerations]
Use this parameter when:

When accelerating a dataset using `mode: memory` (the default), some or all of the dataset is loaded into memory. Ensure sufficient memory is available, including overhead for queries and the runtime, especially with concurrent queries.
- Handling workloads that frequently spill to disk.
- Distributing swap and data I/O operations across multiple storage volumes.

In-memory limitations can be mitigated by storing acceleration data on disk, which is supported by [`duckdb`](./duckdb.md) and [`sqlite`](./sqlite.md) accelerators by specifying `mode: file`.
For more details, refer to the [runtime parameters documentation](/docs/reference/spicepod/index.md#runtimetemp_directory).

:::
For detailed DuckDB limits, see the [DuckDB Memory Management Guide](https://duckdb.org/docs/operations_manual/limits.html).

## Cookbook

- A cookbook recipe to configure DuckDB as a data accelerator in Spice. [DuckDB Data Accelerator](https://github.com/spiceai/cookbook/tree/trunk/duckdb/accelerator#readme)
For practical examples, see the [DuckDB Data Accelerator Cookbook Recipe](https://github.com/spiceai/cookbook/tree/trunk/duckdb/accelerator#readme).
71 changes: 71 additions & 0 deletions website/docs/reference/memory.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,71 @@
---
title: 'Managing Memory Usage'
sidebar_label: 'Memory'
sidebar_position: 31
description: 'Guidelines and best practices for managing memory usage and optimizing performance in Spice.ai Open Source deployments.'
keywords:
- memory
pagination_prev: null
pagination_next: null
---

Effective memory management is essential for maintaining optimal performance and stability in Spice.ai Open Source deployments. This guide outlines recommendations and best practices for managing memory usage.

## General Memory Recommendations

Memory requirements vary based on workload characteristics, dataset sizes, query complexity, and refresh modes. Recommended allocations include:

- **Typical workloads**: At least 8 GB RAM.
- **Larger datasets**:
- `refresh_mode: full`: 2.5x dataset size.
- `refresh_mode: append`: 1.5x dataset size.
- `refresh_mode: changes`: Primarily influenced by CDC event volume and frequency; 1.5x dataset size is a reasonable estimate.

When using DuckDB persistent storage and disk-spilling memory requirements can be reduced. See [DuckDB Data Accelerator](/docs/components/data-accelerators/duckdb.md).

## Refresh Modes and Memory Implications

Refresh modes affect memory usage as follows:

- **Full Refresh**: Temporarily loads data into a new table before replacing the existing table so that it can be atomically swapped and maintain consistency. This requires memory for both tables simultaneously, resulting in higher usage.
- **Append Refresh**: Incrementally inserts or upserts data, using memory only for the incremental data, which reduces usage.
- **Changes Refresh**: Applies CDC events incrementally. Memory usage depends on event volume and frequency, typically resulting in lower and predictable usage.

## DataFusion Memory Management

Spice.ai uses DataFusion as its query execution engine. By default, DataFusion does not enforce strict memory limits, which can lead to unbounded usage. Spice.ai addresses this through:

- **Memory Budgeting**: Limits memory per query execution. Queries exceeding the limit return an error. See [Spicepod Configuration](spicepod/index.md) for details.
- **Spill-to-Disk**: Operators such as Sort, Join, and GroupByHash spill intermediate results to disk when memory limits are exceeded, preventing out-of-memory errors.

## Embedded Data Accelerators

Spice.ai integrates with embedded accelerators like [SQLite](/docs/components/data-accelerators/sqlite.md) and [DuckDB](/docs/components/data-accelerators/duckdb.md), each with unique memory considerations:

- **SQLite**: Lightweight and efficient for smaller datasets. Does not support intermediate spilling; datasets must fit in memory or use application-level paging.
- **DuckDB**: Designed for larger datasets and complex queries. Manages memory through streaming execution, intermediate spilling, and buffer management. See [DuckDB Data Accelerator](/docs/components/data-accelerators/duckdb.md) for more details.

## Kubernetes Memory Configuration

Configure appropriate memory requests and limits in Kubernetes pod specifications to ensure resource availability:

```yaml
apiVersion: v1
kind: Pod
metadata:
name: spice-ai-pod
spec:
containers:
- name: spice-ai-container
image: spiceai/spiceai:latest-models
resources:
requests:
memory: '8Gi'
cpu: '4'
```

## Monitoring and Profiling

Use observability tools to monitor and profile memory usage regularly. This helps identify and resolve potential bottlenecks promptly.

By following these guidelines, developers can manage memory resources effectively, ensuring Spice.ai deployments remain performant, stable, and reliable.
4 changes: 1 addition & 3 deletions website/docs/reference/system_requirements.md
Original file line number Diff line number Diff line change
Expand Up @@ -80,9 +80,7 @@ Spice resource requirements, particularly memory, are highly dependent on worklo
| `refresh_mode: full` | 2.5x the dataset size |
| `refresh_mode: append` | 1.5x the dataset size |

### DuckDB Memory Usage

When using DuckDB (as an accelerator or connector), it by default uses 80% of available memory. For more details, refer to the [DuckDB operations manual limits](https://duckdb.org/docs/operations_manual/limits.html).
See [Memory Management and Best Pratices](memory.md) for a detailed guide on memory considerations.

## Additional Considerations

Expand Down