Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Query Cancellation and query lifecycle clarification #74

Closed
Mike-Dax opened this issue Dec 17, 2024 · 12 comments
Closed

Query Cancellation and query lifecycle clarification #74

Mike-Dax opened this issue Dec 17, 2024 · 12 comments

Comments

@Mike-Dax
Copy link
Contributor

Mike-Dax commented Dec 17, 2024

I'm unfamiliar with the specific lifecycle of a query within DuckDB.

Is it possible to cancel an ongoing, streaming query (perhaps by not calling await result.fetchChunk() when there are chunks left, and by letting the result become garbage collected / by explicitly destroying the result)? If so, could that explicit destruction be considered as an addition to #55?

Is 'backpressure' applied by not calling fetchChunk? If say, many rows were selected in a query, when exactly are they 'materialised' in relation to the fetchChunk call? If the database is remote for example, would a 'cancelled' query prevent later HTTP calls?

@jraymakers
Copy link
Contributor

The best way to cancel query execution is to call connection.interrupt().

Each connection can only execute one query at a time. (Executing queries in parallel is one of the main reasons to create multiple connections.) Calling connection.interrupt() will cause that connection's active query to end immediately with an error (with the message "Interrupted!").

If you run another query on a connection with an active query, the old one will be interrupted. That's another way to cancel, though it's not as explicit as calling interrupt() directly.

I believe if you run a query "normally" (i.e. in non-streaming mode, using, say, connection.run()), then DuckDB will materialize the results eagerly. In other words, DuckDB will calculate and materialize the results as fast as it can, regardless of when you "fetch" them. However, if you use a "pending" result, by creating a prepared statement (e.g. with connection.prepare(sql)) and calling prepared.start(), then DuckDB will only execute the pipeline just enough to generate the first chunk or two, and the the next chunk whenever you "fetch" one. So, in this mode, you can effectively pause execution by not fetching chunks.

You can have even more fine-grained control over query execution by using the pending.runTask() method to run one "task" at a time. Each "task" is a bit of work scheduled by DuckDB. You can run tasks until the result is ready (i.e. runTask() returns RESULT_READY), or you can stop (or pause) before then. This lets you stop (or pause) execution even before the first rows are ready.

@jraymakers
Copy link
Contributor

(Closing because this is a discussion, not an issue, but I'm happy to continue discussing.)

@Mike-Dax
Copy link
Contributor Author

Ah thank you, that's very helpful. The lazy evaluation of result.fetchChunk combined with connection.interrupt() is exactly what I need.

I'm writing some tests to explore these edges. Are queries like "from range(0, 1_000_000)" an idiomatic way to test the streaming capabilities? Fetching the first 1 million rows of the query "from range(0, 10_000_000)" takes about 100ms for me, but fetching the first 1 million rows of "from range(0, 1000_000_000)" takes about 2 seconds. The vast majority of the time is spent in await pending.getResult(). Is that behaviour expected?

Additionally, accessing connection.progress seems to return { percentage: -1, rows_processed: 0n, total_rows_to_process: 0n } regardless of the query methods used.

@jraymakers
Copy link
Contributor

jraymakers commented Dec 19, 2024

Interesting. It seems either something changed, or I was mistaken: I thought prepared.start(), which calls duckdb_pending_prepared, would create a streaming result. It turns out it creates a pending result, but passes false for allow_stream_result, which makes it a non-streaming result. There's another C API, duckdb_pending_prepared_streaming, which creates a streaming result, but I missed it because it's deprecated; I'm avoiding deprecated methods in Node Neo.

Unfortunately I don't see a non-deprecated way to create a streaming result through the C API. Let me ask around and see what the best path forward is.

@jraymakers
Copy link
Contributor

Regarding connection.progress: I've also observed what you report, that it always seems to return -1 and 0. But then I learned that there's a config setting that needs to be enabled for progress reporting to work: enable_progress_bar. I haven't fully tested this yet, but I believe if you set enable_progress_bar=true, then connection.progress will work.

@Mike-Dax
Copy link
Contributor Author

Mike-Dax commented Dec 19, 2024

Regarding connection.progress: I've also observed what you report, that it always seems to return -1 and 0. But then I learned that there's a config setting that needs to be enabled for progress reporting to work: enable_progress_bar. I haven't fully tested this yet, but I believe if you set enable_progress_bar=true, then connection.progress will work.

Hmm, I've run await connection.run("SET enable_progress_bar = true;"); and validated it's set via SELECT current_setting('enable_progress_bar') just before running my queries and I get the same behaviour, with both connection.run style queries and the connection.prepare, result.fetchChunk style queries.

Interesting. It seems either something changed, or I was mistaken: I thought prepared.start(), which calls duckdb_pending_prepared, would create a streaming result. It turns out it creates a pending result, but passes false for allow_stream_result, which makes it a non-streaming result. There's another C API, duckdb_pending_prepared_streaming, which creates a streaming result, but I missed it because it's deprecated; I'm avoiding deprecated methods in Node Neo.

Unfortunately I don't see a non-deprecated way to create a streaming result through the C API. Let me ask around and see what the best path forward is.

Thank you for looking into it. Streaming results are a very desirable feature for our use case.

For some context, I'm working on a low latency, real-time signal processing and control UI for hardware. It has an incremental compute and rendering engine. I'm considering DuckDB as the primary save file format (to extend the use case beyond just real-time). Appender support gets us in the territory of write performance that I'm looking for. Streaming results are a good fit for the incremental compute and rendering engine. Additionally, the ability to offload more traditional aggregations of historical data to DuckDB is a big win.

@jraymakers
Copy link
Contributor

I also just experimented with enable_progress_bar and was also unable to get it to work. I suspect it may only report progress in some scenarios. I'll continue experimenting and let you know if I learn anything.

Definitely understand the value of streaming results. I've used them in other contexts, using the C++ API directly, so I know they work. I'm pretty sure it used to work with Node Neo (a while ago), so I think the behavior of the C API may have changed at some point after I wrote the relevant part of the bindings. I'm inquiring as to the best way to make it work again.

@jraymakers
Copy link
Contributor

Finally got something to work regarding query progress. The follow script prints some meaningful progress numbers:

import { DuckDBInstance, DuckDBPendingResultState } from '@duckdb/node-api';

async function sleep(ms) {
  return new Promise((resolve) => {
    setTimeout(resolve, ms);
  });
}

const instance = await DuckDBInstance.create();
const connection = await instance.connect();
await connection.run('pragma enable_progress_bar');

await connection.run('create table tbl as select range a, mod(range,10) b from range(10000000)');
await connection.run('create table tbl_2 as select range a from range(10000000)');

const query = 'select count(*) from tbl inner join tbl_2 on (tbl.a = tbl_2.a)';

const prepared = await connection.prepare(query);
const pending = prepared.start();
while (pending.runTask() !== DuckDBPendingResultState.RESULT_READY) {
  console.log('not ready', connection.progress);
  await sleep(10);
}
console.log('ready', connection.progress);
await pending.getResult();
console.log('got result', connection.progress);

I got those queries from a test case in the duckdb repo. It seems the query plan matters for query progress. Probably only some operators report progress.

@jraymakers
Copy link
Contributor

Regarding streaming results, it seems that, while the duckdb_pending_prepared_streaming function itself is likely to be removed in the future, the ability to create streaming results is unlikely to be removed. It will likely just be expressed in a different way. That means I can use this function with low risk that the functionality it enables will go away.

@jraymakers
Copy link
Contributor

Filed #76 to track implementing streaming results.

@jraymakers
Copy link
Contributor

You may have noticed already, but I published a new version (1.1.3-alpha.8) that provides access to streaming results. Details are in the README. Let me know how it works for you!

@Mike-Dax
Copy link
Contributor Author

Thank you! It seems to work really well. I'm able to run queries like from range(1_000_000_000_000) and it materializes lazily, doesn't OOM.

I've written some benchmarks for the various streaming API combinations to explore the state-space, and a benchmark for query parameterisation as well. The results are pretty much as expected.

Happy holidays!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants