Query Cancellation and query lifecycle clarification #74

Mike-Dax · 2024-12-17T04:07:09Z

I'm unfamiliar with the specific lifecycle of a query within DuckDB.

Is it possible to cancel an ongoing, streaming query (perhaps by not calling await result.fetchChunk() when there are chunks left, and by letting the result become garbage collected / by explicitly destroying the result)? If so, could that explicit destruction be considered as an addition to #55?

Is 'backpressure' applied by not calling fetchChunk? If say, many rows were selected in a query, when exactly are they 'materialised' in relation to the fetchChunk call? If the database is remote for example, would a 'cancelled' query prevent later HTTP calls?

The text was updated successfully, but these errors were encountered:

jraymakers · 2024-12-17T05:17:01Z

The best way to cancel query execution is to call connection.interrupt().

Each connection can only execute one query at a time. (Executing queries in parallel is one of the main reasons to create multiple connections.) Calling connection.interrupt() will cause that connection's active query to end immediately with an error (with the message "Interrupted!").

If you run another query on a connection with an active query, the old one will be interrupted. That's another way to cancel, though it's not as explicit as calling interrupt() directly.

I believe if you run a query "normally" (i.e. in non-streaming mode, using, say, connection.run()), then DuckDB will materialize the results eagerly. In other words, DuckDB will calculate and materialize the results as fast as it can, regardless of when you "fetch" them. However, if you use a "pending" result, by creating a prepared statement (e.g. with connection.prepare(sql)) and calling prepared.start(), then DuckDB will only execute the pipeline just enough to generate the first chunk or two, and the the next chunk whenever you "fetch" one. So, in this mode, you can effectively pause execution by not fetching chunks.

You can have even more fine-grained control over query execution by using the pending.runTask() method to run one "task" at a time. Each "task" is a bit of work scheduled by DuckDB. You can run tasks until the result is ready (i.e. runTask() returns RESULT_READY), or you can stop (or pause) before then. This lets you stop (or pause) execution even before the first rows are ready.

jraymakers · 2024-12-17T05:18:15Z

(Closing because this is a discussion, not an issue, but I'm happy to continue discussing.)

Mike-Dax · 2024-12-18T13:46:04Z

Ah thank you, that's very helpful. The lazy evaluation of result.fetchChunk combined with connection.interrupt() is exactly what I need.

I'm writing some tests to explore these edges. Are queries like "from range(0, 1_000_000)" an idiomatic way to test the streaming capabilities? Fetching the first 1 million rows of the query "from range(0, 10_000_000)" takes about 100ms for me, but fetching the first 1 million rows of "from range(0, 1000_000_000)" takes about 2 seconds. The vast majority of the time is spent in await pending.getResult(). Is that behaviour expected?

Additionally, accessing connection.progress seems to return { percentage: -1, rows_processed: 0n, total_rows_to_process: 0n } regardless of the query methods used.

jraymakers · 2024-12-19T01:21:18Z

Interesting. It seems either something changed, or I was mistaken: I thought prepared.start(), which calls duckdb_pending_prepared, would create a streaming result. It turns out it creates a pending result, but passes false for allow_stream_result, which makes it a non-streaming result. There's another C API, duckdb_pending_prepared_streaming, which creates a streaming result, but I missed it because it's deprecated; I'm avoiding deprecated methods in Node Neo.

Unfortunately I don't see a non-deprecated way to create a streaming result through the C API. Let me ask around and see what the best path forward is.

jraymakers · 2024-12-19T01:23:16Z

Regarding connection.progress: I've also observed what you report, that it always seems to return -1 and 0. But then I learned that there's a config setting that needs to be enabled for progress reporting to work: enable_progress_bar. I haven't fully tested this yet, but I believe if you set enable_progress_bar=true, then connection.progress will work.

Mike-Dax · 2024-12-19T03:56:55Z

Regarding connection.progress: I've also observed what you report, that it always seems to return -1 and 0. But then I learned that there's a config setting that needs to be enabled for progress reporting to work: enable_progress_bar. I haven't fully tested this yet, but I believe if you set enable_progress_bar=true, then connection.progress will work.

Hmm, I've run await connection.run("SET enable_progress_bar = true;"); and validated it's set via SELECT current_setting('enable_progress_bar') just before running my queries and I get the same behaviour, with both connection.run style queries and the connection.prepare, result.fetchChunk style queries.

Interesting. It seems either something changed, or I was mistaken: I thought prepared.start(), which calls duckdb_pending_prepared, would create a streaming result. It turns out it creates a pending result, but passes false for allow_stream_result, which makes it a non-streaming result. There's another C API, duckdb_pending_prepared_streaming, which creates a streaming result, but I missed it because it's deprecated; I'm avoiding deprecated methods in Node Neo.

Unfortunately I don't see a non-deprecated way to create a streaming result through the C API. Let me ask around and see what the best path forward is.

Thank you for looking into it. Streaming results are a very desirable feature for our use case.

For some context, I'm working on a low latency, real-time signal processing and control UI for hardware. It has an incremental compute and rendering engine. I'm considering DuckDB as the primary save file format (to extend the use case beyond just real-time). Appender support gets us in the territory of write performance that I'm looking for. Streaming results are a good fit for the incremental compute and rendering engine. Additionally, the ability to offload more traditional aggregations of historical data to DuckDB is a big win.

jraymakers · 2024-12-19T06:29:03Z

I also just experimented with enable_progress_bar and was also unable to get it to work. I suspect it may only report progress in some scenarios. I'll continue experimenting and let you know if I learn anything.

Definitely understand the value of streaming results. I've used them in other contexts, using the C++ API directly, so I know they work. I'm pretty sure it used to work with Node Neo (a while ago), so I think the behavior of the C API may have changed at some point after I wrote the relevant part of the bindings. I'm inquiring as to the best way to make it work again.

jraymakers · 2024-12-19T06:38:12Z

Finally got something to work regarding query progress. The follow script prints some meaningful progress numbers:

import { DuckDBInstance, DuckDBPendingResultState } from '@duckdb/node-api';

async function sleep(ms) {
  return new Promise((resolve) => {
    setTimeout(resolve, ms);
  });
}

const instance = await DuckDBInstance.create();
const connection = await instance.connect();
await connection.run('pragma enable_progress_bar');

await connection.run('create table tbl as select range a, mod(range,10) b from range(10000000)');
await connection.run('create table tbl_2 as select range a from range(10000000)');

const query = 'select count(*) from tbl inner join tbl_2 on (tbl.a = tbl_2.a)';

const prepared = await connection.prepare(query);
const pending = prepared.start();
while (pending.runTask() !== DuckDBPendingResultState.RESULT_READY) {
  console.log('not ready', connection.progress);
  await sleep(10);
}
console.log('ready', connection.progress);
await pending.getResult();
console.log('got result', connection.progress);

I got those queries from a test case in the duckdb repo. It seems the query plan matters for query progress. Probably only some operators report progress.

jraymakers · 2024-12-20T20:04:58Z

Regarding streaming results, it seems that, while the duckdb_pending_prepared_streaming function itself is likely to be removed in the future, the ability to create streaming results is unlikely to be removed. It will likely just be expressed in a different way. That means I can use this function with low risk that the functionality it enables will go away.

jraymakers · 2024-12-20T20:17:22Z

Filed #76 to track implementing streaming results.

jraymakers · 2024-12-24T01:48:18Z

You may have noticed already, but I published a new version (1.1.3-alpha.8) that provides access to streaming results. Details are in the README. Let me know how it works for you!

Mike-Dax · 2024-12-24T10:25:19Z

Thank you! It seems to work really well. I'm able to run queries like from range(1_000_000_000_000) and it materializes lazily, doesn't OOM.

I've written some benchmarks for the various streaming API combinations to explore the state-space, and a benchmark for query parameterisation as well. The results are pretty much as expected.

Happy holidays!

jraymakers closed this as completed Dec 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Query Cancellation and query lifecycle clarification #74

Query Cancellation and query lifecycle clarification #74

Mike-Dax commented Dec 17, 2024 •

edited

Loading

jraymakers commented Dec 17, 2024

jraymakers commented Dec 17, 2024

Mike-Dax commented Dec 18, 2024

jraymakers commented Dec 19, 2024 •

edited

Loading

jraymakers commented Dec 19, 2024

Mike-Dax commented Dec 19, 2024 •

edited by jraymakers

Loading

jraymakers commented Dec 19, 2024

jraymakers commented Dec 19, 2024

jraymakers commented Dec 20, 2024

jraymakers commented Dec 20, 2024

jraymakers commented Dec 24, 2024

Mike-Dax commented Dec 24, 2024

Query Cancellation and query lifecycle clarification #74

Query Cancellation and query lifecycle clarification #74

Comments

Mike-Dax commented Dec 17, 2024 • edited Loading

jraymakers commented Dec 17, 2024

jraymakers commented Dec 17, 2024

Mike-Dax commented Dec 18, 2024

jraymakers commented Dec 19, 2024 • edited Loading

jraymakers commented Dec 19, 2024

Mike-Dax commented Dec 19, 2024 • edited by jraymakers Loading

jraymakers commented Dec 19, 2024

jraymakers commented Dec 19, 2024

jraymakers commented Dec 20, 2024

jraymakers commented Dec 20, 2024

jraymakers commented Dec 24, 2024

Mike-Dax commented Dec 24, 2024

Mike-Dax commented Dec 17, 2024 •

edited

Loading

jraymakers commented Dec 19, 2024 •

edited

Loading

Mike-Dax commented Dec 19, 2024 •

edited by jraymakers

Loading