Skip to content

Commit ea55835

Browse files
lukekimy-f-uphillipleblanc
authored
Federated queries docs (#166)
* Federated queries docs WIP * Update Spicepod * Update spiceaidocs/docs/federated-queries/index.md Co-authored-by: Phillip LeBlanc <phillip@spiceai.io> * more meat * nicer looking query * refer to quickstart for docker/postgres setup * more limitations * rewording acceleration * wording * spacing --------- Co-authored-by: yfu <fevin86@gmail.com> Co-authored-by: Phillip LeBlanc <phillip@spiceai.io>
1 parent cf030ea commit ea55835

File tree

1 file changed

+160
-1
lines changed
  • spiceaidocs/docs/federated-queries

1 file changed

+160
-1
lines changed
Lines changed: 160 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,167 @@
11
---
22
title: 'Federated Queries'
33
sidebar_label: 'Federated Queries'
4-
description: ''
4+
description: 'Learn how to use federated queries in Spice.'
55
sidebar_position: 2
66
pagination_prev: null
77
pagination_next: null
88
---
9+
10+
Spice provides a powerful federated query feature that allows you to join and combine data from multiple data sources and perform complex queries. This feature enables you to leverage the full potential of your data by aggregating and analyzing information wherever it is stored.
11+
12+
Spice supports federated query across databases (PostgreSQL, MySQL, etc.), data warehouses (Databricks, Snowflake, BigQuery, etc.), and data lakes (S3, MinIO, etc.). See [Data Connectors](/data-connectors/index.md) for the full list of supported sources.
13+
14+
### Getting Started
15+
16+
#### Pre-requisites
17+
18+
- Install Spice by following the [installation instructions](/getting-started/index.md).
19+
- Run thought the [federation quickstart guide](https://github.com/spiceai/quickstarts/blob/trunk/federation/README.md) for required docker and local postgres setup.
20+
21+
#### Steps
22+
To get started with federated queries using Spice, follow these steps:
23+
24+
**Step 1.** Create a new Spice app called `demo`.
25+
26+
```bash
27+
# Create Spice app "demo"
28+
spice init demo
29+
30+
# Change to demo directory.
31+
cd demo
32+
```
33+
34+
**Step 2.** Start the Spice runtime.
35+
36+
```bash
37+
spice run
38+
```
39+
40+
**Step 3.** Open a new terminal and add the `spiceai/fed-demo` Spicepod.
41+
42+
```bash
43+
spice add spiceai/fed-demo
44+
```
45+
46+
Note in the Spice runtime output several datasets are loaded.
47+
48+
**Step 4.** Show available tables and query them, regardless of source.
49+
50+
```bash
51+
# Start the Spice SQL REPL.
52+
spice sql
53+
```
54+
55+
Show the available tables:
56+
57+
```sql
58+
show tables;
59+
```
60+
61+
Execute the queries:
62+
63+
```sql
64+
-- Query S3 (Parquet)
65+
SELECT *
66+
FROM s3_source LIMIT 10
67+
68+
-- Query S3 (Parquet) accelerated
69+
SELECT *
70+
FROM s3_source_accelerated LIMIT 10
71+
72+
-- Query PostgreSQL
73+
SELECT *
74+
FROM pg_source LIMIT 10
75+
76+
-- Query Dremio
77+
SELECT *
78+
FROM dremio_source LIMIT 10
79+
80+
-- Query Dremio accelerated
81+
SELECT *
82+
FROM dremio_source_accelerated LIMIT 10
83+
```
84+
85+
**Step 5.** Join tables across remote sources and query
86+
87+
```sql
88+
-- Query across S3, PostgreSQL, and Dremio
89+
sql> WITH order_numbers AS (
90+
SELECT DISTINCT order_number
91+
FROM s3_source
92+
WHERE order_number IN (
93+
SELECT order_number
94+
FROM pg_source
95+
)
96+
)
97+
SELECT
98+
AVG(total_amount),
99+
passenger_count
100+
FROM dremio_source
101+
WHERE passenger_count IN (
102+
SELECT DISTINCT order_number % 10 AS num_of_passenger
103+
FROM order_numbers
104+
)
105+
GROUP BY passenger_count;
106+
+---------------------------------+-----------------+
107+
| AVG(dremio_source.total_amount) | passenger_count |
108+
+---------------------------------+-----------------+
109+
| 17.219515789473693 | 4 |
110+
| 22.401176470588233 | 6 |
111+
| 21.12263157894737 | 5 |
112+
| 17.441359661495103 | 3 |
113+
| 23.2 | 0 |
114+
| 17.714222499449477 | 2 |
115+
| 15.394881909237105 | 1 |
116+
+---------------------------------+-----------------+
117+
118+
Query took: 3.345525166 seconds. 7/7 rows displayed.
119+
```
120+
121+
**Step 6.** Join tables across locally accelerated sources and query
122+
123+
```sql
124+
-- Query across S3 accelerated, PostgreSQL, and Dremio accelerated
125+
sql> WITH order_numbers AS (
126+
SELECT DISTINCT order_number
127+
FROM s3_source_accelerated
128+
WHERE order_number IN (
129+
SELECT order_number
130+
FROM pg_source
131+
)
132+
)
133+
SELECT
134+
AVG(total_amount),
135+
passenger_count
136+
FROM dremio_source_accelerated
137+
WHERE passenger_count IN (
138+
SELECT DISTINCT order_number % 10 AS num_of_passenger
139+
FROM order_numbers
140+
)
141+
GROUP BY passenger_count;
142+
+---------------------------------------------+-----------------+
143+
| AVG(dremio_source_accelerated.total_amount) | passenger_count |
144+
+---------------------------------------------+-----------------+
145+
| 21.12263157894737 | 5 |
146+
| 17.219515789473693 | 4 |
147+
| 22.401176470588233 | 6 |
148+
| 17.441359661495113 | 3 |
149+
| 23.2 | 0 |
150+
| 17.714222499449434 | 2 |
151+
| 15.394881909237196 | 1 |
152+
+---------------------------------------------+-----------------+
153+
154+
Query took: 0.045805958 seconds. 7/7 rows displayed.
155+
```
156+
157+
### Acceleration
158+
159+
While the query in step 5 successfully returned results from federated remote data sources, the performance was suboptimal due to data transfer overhead.
160+
161+
To improve query performance, step 6 demonstrates the same query executed against locally materialized and accelerated datasets using [Data Accelerators](/data-accelerators/index.md), resulting in significant performance gains.
162+
163+
### Limitations
164+
165+
- **Query Optimization:** Filter/Join/Aggregation pushdown is not supported, potentially leading to suboptimal query plan.
166+
- **Query Performance:** Without acceleration, federated queries will be slower than local queries due to network latency and data transfer.
167+
- **Query Capabilities:** Not all SQL features and data types are supported across all data sources. More complex data type queries may not work as expected.

0 commit comments

Comments
 (0)