Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible misuse of CAPI causing wal files to remain intact and mek DB inaccessible #83

Closed
freakynit opened this issue Dec 26, 2024 · 12 comments

Comments

@freakynit
Copy link
Contributor

Here is the context (have a look at the reply on this too):

duckdb/duckdb#10002 (comment)

@jraymakers
Copy link
Contributor

This was resolved in the linked thread.

@freakynit
Copy link
Contributor Author

@jraymakers Just wanted to say... the new .stream() api is inanely fast compared to previous versions' .run() api... i saw like 10x improvement... my use case was to run query on entire data set, but only read first 2048 rows from resultset.

Awesome work! Thanks..

May I sugest to add the usage of .stream() API in npm docs too? Currently it doesn;t list it's usage.. but, only the call... all other api's show using resultset too. I had to refer to this PR to learn how to use it: https://github.com/duckdb/duckdb-node-neo/pull/81/files

@jraymakers
Copy link
Contributor

Glad to hear it!

I'll think about how I can make the different ways of running results more clear. There are a lot of combinations, so it's tricky to document them all clearly. I did add a bit about streamAndReadUntil, which is my recommended way of doing streaming, but I probably didn't make this clear enough.

@freakynit
Copy link
Contributor Author

@jraymakers yep.. i think so many very similar looking API's are adding a bit cognitive load when deciding which one to use. Maybe grouping them by use-cases might help.

@freakynit
Copy link
Contributor Author

@jraymakers Was trying new streamAndReadUntil api... getting this crash:

No tables to drop.
Tables dropped. Loading data and getting count...

<--- Last few GCs --->

[5009:0x110008000]    32393 ms: Scavenge (reduce) 2047.7 (2082.4) -> 2047.2 (2082.9) MB, 22.29 / 0.00 ms  (average mu = 0.386, current mu = 0.387) allocation failure; 
[5009:0x110008000]    32568 ms: Mark-Compact (reduce) 2048.0 (2082.9) -> 2047.3 (2083.1) MB, 112.96 / 0.00 ms  (+ 183.2 ms in 39 steps since start of marking, biggest step 6.4 ms, walltime since start of marking 355 ms) (average mu = 0.374, current mu = 0

<--- JS stacktrace --->

FATAL ERROR: Ineffective mark-compacts near heap limit Allocation failed - JavaScript heap out of memory
----- Native stack trace -----

 1: 0x102a3959c node::Abort() [/Users/nitinbansal/.nvm/versions/node/v20.11.1/bin/node]
 2: 0x102a3979c node::ModifyCodeGenerationFromStrings(v8::Local<v8::Context>, v8::Local<v8::Value>, bool) [/Users/nitinbansal/.nvm/versions/node/v20.11.1/bin/node]
 3: 0x102bbe95c v8::internal::V8::FatalProcessOutOfMemory(v8::internal::Isolate*, char const*, v8::OOMDetails const&) [/Users/nitinbansal/.nvm/versions/node/v20.11.1/bin/node]
 4: 0x102d93050 v8::internal::Heap::GarbageCollectionReasonToString(v8::internal::GarbageCollectionReason) [/Users/nitinbansal/.nvm/versions/node/v20.11.1/bin/node]
 5: 0x102d96f04 v8::internal::Heap::CollectGarbageShared(v8::internal::LocalHeap*, v8::internal::GarbageCollectionReason) [/Users/nitinbansal/.nvm/versions/node/v20.11.1/bin/node]
 6: 0x102d93968 v8::internal::Heap::PerformGarbageCollection(v8::internal::GarbageCollector, v8::internal::GarbageCollectionReason, char const*) [/Users/nitinbansal/.nvm/versions/node/v20.11.1/bin/node]
 7: 0x102d916f0 v8::internal::Heap::CollectGarbage(v8::internal::AllocationSpace, v8::internal::GarbageCollectionReason, v8::GCCallbackFlags) [/Users/nitinbansal/.nvm/versions/node/v20.11.1/bin/node]
 8: 0x102d88344 v8::internal::HeapAllocator::AllocateRawWithLightRetrySlowPath(int, v8::internal::AllocationType, v8::internal::AllocationOrigin, v8::internal::AllocationAlignment) [/Users/nitinbansal/.nvm/versions/node/v20.11.1/bin/node]
 9: 0x102d88ba4 v8::internal::HeapAllocator::AllocateRawWithRetryOrFailSlowPath(int, v8::internal::AllocationType, v8::internal::AllocationOrigin, v8::internal::AllocationAlignment) [/Users/nitinbansal/.nvm/versions/node/v20.11.1/bin/node]
10: 0x102d6d800 v8::internal::Factory::AllocateRawWithAllocationSite(v8::internal::Handle<v8::internal::Map>, v8::internal::AllocationType, v8::internal::Handle<v8::internal::AllocationSite>) [/Users/nitinbansal/.nvm/versions/node/v20.11.1/bin/node]
11: 0x102d7bd50 v8::internal::Factory::NewJSArrayBufferView(v8::internal::Handle<v8::internal::Map>, v8::internal::Handle<v8::internal::FixedArrayBase>, v8::internal::Handle<v8::internal::JSArrayBuffer>, unsigned long, unsigned long) [/Users/nitinbansal/.nvm/versions/node/v20.11.1/bin/node]
12: 0x102d7c214 v8::internal::Factory::NewJSTypedArray(v8::internal::ExternalArrayType, v8::internal::Handle<v8::internal::JSArrayBuffer>, unsigned long, unsigned long, bool) [/Users/nitinbansal/.nvm/versions/node/v20.11.1/bin/node]
13: 0x102be7088 v8::Uint8Array::New(v8::Local<v8::ArrayBuffer>, unsigned long, unsigned long) [/Users/nitinbansal/.nvm/versions/node/v20.11.1/bin/node]
14: 0x102a10814 node::Buffer::Copy(node::Environment*, char const*, unsigned long) [/Users/nitinbansal/.nvm/versions/node/v20.11.1/bin/node]
15: 0x102a1065c node::Buffer::Copy(v8::Isolate*, char const*, unsigned long) [/Users/nitinbansal/.nvm/versions/node/v20.11.1/bin/node]
16: 0x102a070d4 napi_create_buffer_copy [/Users/nitinbansal/.nvm/versions/node/v20.11.1/bin/node]
17: 0x10861d6b0 DuckDBNodeAddon::get_data_from_pointer(Napi::CallbackInfo const&) [/Users/nitinbansal/dev/gitlab/nodejs/zenquery/node_modules/@duckdb/node-bindings-darwin-arm64/duckdb.node]
18: 0x10860c74c Napi::InstanceWrap<DuckDBNodeAddon>::InstanceMethodCallbackWrapper(napi_env__*, napi_callback_info__*) [/Users/nitinbansal/dev/gitlab/nodejs/zenquery/node_modules/@duckdb/node-bindings-darwin-arm64/duckdb.node]
19: 0x1029f797c v8impl::(anonymous namespace)::FunctionCallbackWrapper::Invoke(v8::FunctionCallbackInfo<v8::Value> const&) [/Users/nitinbansal/.nvm/versions/node/v20.11.1/bin/node]
20: 0x10342def0 Builtins_CallApiCallback [/Users/nitinbansal/.nvm/versions/node/v20.11.1/bin/node]
21: 0x1380751f4 
22: 0x138075e74 
23: 0x103463210 Builtins_AsyncFunctionAwaitResolveClosure [/Users/nitinbansal/.nvm/versions/node/v20.11.1/bin/node]
24: 0x103510fb8 Builtins_PromiseFulfillReactionJob [/Users/nitinbansal/.nvm/versions/node/v20.11.1/bin/node]
25: 0x103452b94 Builtins_RunMicrotasks [/Users/nitinbansal/.nvm/versions/node/v20.11.1/bin/node]
26: 0x10342a3f4 Builtins_JSRunMicrotasksEntry [/Users/nitinbansal/.nvm/versions/node/v20.11.1/bin/node]
27: 0x102d004d0 v8::internal::(anonymous namespace)::Invoke(v8::internal::Isolate*, v8::internal::(anonymous namespace)::InvokeParams const&) [/Users/nitinbansal/.nvm/versions/node/v20.11.1/bin/node]
28: 0x102d009bc v8::internal::(anonymous namespace)::InvokeWithTryCatch(v8::internal::Isolate*, v8::internal::(anonymous namespace)::InvokeParams const&) [/Users/nitinbansal/.nvm/versions/node/v20.11.1/bin/node]
29: 0x102d00b98 v8::internal::Execution::TryRunMicrotasks(v8::internal::Isolate*, v8::internal::MicrotaskQueue*) [/Users/nitinbansal/.nvm/versions/node/v20.11.1/bin/node]
30: 0x102d27d64 v8::internal::MicrotaskQueue::RunMicrotasks(v8::internal::Isolate*) [/Users/nitinbansal/.nvm/versions/node/v20.11.1/bin/node]
31: 0x102d28500 v8::internal::MicrotaskQueue::PerformCheckpoint(v8::Isolate*) [/Users/nitinbansal/.nvm/versions/node/v20.11.1/bin/node]
32: 0x102968c64 node::InternalCallbackScope::Close() [/Users/nitinbansal/.nvm/versions/node/v20.11.1/bin/node]
33: 0x10296877c node::CallbackScope::~CallbackScope() [/Users/nitinbansal/.nvm/versions/node/v20.11.1/bin/node]
34: 0x102a0885c (anonymous namespace)::uvimpl::Work::AfterThreadPoolWork(int) [/Users/nitinbansal/.nvm/versions/node/v20.11.1/bin/node]
35: 0x102a08c50 node::ThreadPoolWork::ScheduleWork()::'lambda'(uv_work_s*, int)::operator()(uv_work_s*, int) const [/Users/nitinbansal/.nvm/versions/node/v20.11.1/bin/node]
36: 0x10340639c uv__work_done [/Users/nitinbansal/.nvm/versions/node/v20.11.1/bin/node]
37: 0x103409dec uv__async_io [/Users/nitinbansal/.nvm/versions/node/v20.11.1/bin/node]
38: 0x10341bec4 uv__io_poll [/Users/nitinbansal/.nvm/versions/node/v20.11.1/bin/node]
39: 0x10340a3b0 uv_run [/Users/nitinbansal/.nvm/versions/node/v20.11.1/bin/node]
40: 0x102969754 node::SpinEventLoopInternal(node::Environment*) [/Users/nitinbansal/.nvm/versions/node/v20.11.1/bin/node]
41: 0x102a79c6c node::NodeMainInstance::Run(node::ExitCode*, node::Environment*) [/Users/nitinbansal/.nvm/versions/node/v20.11.1/bin/node]
42: 0x102a79a08 node::NodeMainInstance::Run() [/Users/nitinbansal/.nvm/versions/node/v20.11.1/bin/node]
43: 0x102a03718 node::Start(int, char**) [/Users/nitinbansal/.nvm/versions/node/v20.11.1/bin/node]
44: 0x1831da0e0 start [/usr/lib/dyld]

Test code

async function run7(connection) {
    const dataFilePaths = [
        '/Users/nitinbansal/Downloads/mix data/generated.csv',
        '/Users/nitinbansal/Downloads/hackernews_10k_dec 2024.csv'
    ];
    const tableNames = ['table_1', 'table_2'];

    await useConnection(DB_FILE_PATH, async (connection) => {
        await dropTables(connection, true);

        console.log(`Tables dropped. Loading data and getting count...`);

        await loadFile(connection, dataFilePaths[0], tableNames[0], 3);
        await loadFile(connection, dataFilePaths[1], tableNames[1], 3);

        let result = null;
        for(let k = 0; k < 200; k++) {
            console.log(`iteration: ${k+1}`);

            result = await runUsingStreamAndReadUntil(connection, `select * from table_1`, 2048);
            console.log(`selected row count ${result?.rows?.length}`)

            result = await runUsingStreamAndReadUntil(connection, `select * from table_2`, 2048);
            console.log(`selected row count ${result?.rows?.length}`)
        }
    });
}

async function useConnection(dbFilePath, callback) {
    let connection = null;
    try {
        connection = await initConnection(dbFilePath);
        return await callback(connection); // Return the result of the callback
    } catch (error) {
        console.error("Error while using connection:", error);
        throw error;
    } finally {
        if(connection) {
            connection.disconnect();
            // connection = null; // todo: prod: use close() if available
        }
    }
}

async function initConnection(dbFilePath) {
    const instance = await DuckDBInstance.create(dbFilePath);
    return await instance.connect();
}

async function runUsingStreamAndReadUntil(connection, query, maxRowsToFetch = 2048) {
    const result = await connection.streamAndReadUntil(query, maxRowsToFetch);

    const collectedRows = [];

    while(true) {
        const rows = result.getRows();
        if (!rows || rows.length === 0) {
            break;
        }

        console.log(`got ${rows.length} rows`);

        rows.forEach(row => collectedRows.push(row));
    }

    return collectedRows;
}

Source file details

  1. hackernews_10k_dec 2024.csv: 10K rows, 4.1MB CSV
  2. generated.csv: 9M rows, 1.03GB CSV

Also, from docs and PR both, it's not clear how to proceed and use result returned from streamAndReadUntil call. So, I tried chunk based extraction too (on top of rows based on shown above).. net result is same for both... crash

while(true) {
        const chunk = await result.fetchChunk();
        if (!chunk || chunk.rowCount === 0) {
            break;
        }

        for (const row of chunk.getRows()) {
            collectedRows.push(row);
        }
    }

@freakynit
Copy link
Contributor Author

@jraymakers continuing with above...

I know it's heap related error, but, the same code using .stream() API works without throwing any error.

@jraymakers
Copy link
Contributor

It looks like you're running into a 2 GB heap limit in your Node environment. Since one of your files is 1 GB and you're creating multiple DuckDB instances in a loop, it's not very surprising you'd hit this limit.

A few options/recommendations:

  • Don't create multiple DuckDB instances in a loop, especially if you're reading in large files. You'll likely end up with multiple copies of those large files in memory.
  • If you do want to create multiple instances (though I'm not sure why you would), then explicitly DETACH large databases/files when you're done. This, I think, is likely to free up the memory they're using. Note that disconnecting does not perform this cleanup.
  • Increase the memory limit of your Node process, using the --max-old-space-size command-line flag.

I agree the documentation of the various ways to run SQL and get results could be improved, but I suspect that isn't what's causing your problem here. If you think it is, please send the exact differences. The streamAndReadUntil helper is just a wrapper around calls to stream and fetchChunk, so it's not clear why it would be any different.

@freakynit
Copy link
Contributor Author

freakynit commented Dec 29, 2024

@jraymakers Thanks for the info. A few points:

  1. I believe disconnect/close should immediately free used memory (or mark it as freed for OS to reclaim) since the programmer is making it explicitly clear that they are not going to use it again. This is pretty standard expectation of connections and associated resources.
  2. The difference between stream and streamAndReadUntil is indeed there. Since switching over to stream api yesterday, not a single time heap crash has occured when testing for same files.
  3. The problem with maintaining a single connection is that it forces global state in an application. I could wrap it in a global function, but yet another problem of that long lives connection is that it might have become stale (for whatever reason). This can be handled by reconnecting. But then it becomes a burden on some other part of the code. Another way would be issue a simple select 1 statement, but then this will have to be issued before every lease of the connection. This is where connection pooling helps since it can in background from time to time issue this test query.. But I implement that either here because of restriction of having just one connection. So, overall, im stuck.

I'm mostly from java background, so, im a bit biased to think that way, but, the current behaviour does seem odd where API is not doing fully what it's saying.

Attaching GIF recording of same showing how readUntil is crashing whereas stream is working fine while keeping everything else as-is.

attaching in next comment

@freakynit
Copy link
Contributor Author

Attaching video recording link instead. This shows without doubt the observed difference

https://drive.google.com/file/d/1HM3J0DKNvY8FhN3EaBiML_DvWZltrtCI/view?usp=sharing

Also, here's the code (I have packed all relevant methods in a single file for easier testing)

import {DuckDBInstance} from "@duckdb/node-api";
import fs from 'node:fs';

async function initConnection(dbFilePath) {
    const instance = await DuckDBInstance.create(dbFilePath);
    return await instance.connect();
}

async function runUsingStreamAndReadUntil(connection, query, maxRowsToFetch = 2048) {
    const result = await connection.streamAndReadUntil(query, maxRowsToFetch);

    const collectedRows = [];

    while(true) {
        const rows = result.getRows();
        if (!rows || rows.length === 0) {
            break;
        }

        for (const row of rows) {
            collectedRows.push(row);
        }
    }

    return collectedRows;
}

async function runUsingStream(connection, query, maxRowsToFetch = 2048) {
    const result = await connection.stream(query);

    const collectedRows = [];
    let rowsFetched = 0;

    while (maxRowsToFetch < 0 || (maxRowsToFetch > -1 && rowsFetched < maxRowsToFetch)) {
        const chunk = await result.fetchChunk();
        if (!chunk || chunk.rowCount === 0) {
            break;
        }

        for (const row of chunk.getRows()) {
            if (maxRowsToFetch > -1 && rowsFetched >= maxRowsToFetch) break;
            collectedRows.push(row);
            rowsFetched++;
        }
    }

    return collectedRows;
}

async function loadFile(connection, filepath, tableName) {
    let query = `CREATE TABLE ${tableName} AS SELECT * FROM read_csv_auto('${filepath}')`;
    await connection.run(query);
}

async function deleteFile(filePath) {
    try {
        await fs.unlinkSync(filePath);
    } catch (err) {
        if (err.code !== 'ENOENT') {
            throw err;
        } else {
            // ignore
        }
    }
}

async function main() {
    const DB_FILE_PATH = './db.bin';
    await deleteFile(DB_FILE_PATH);
    await new Promise(resolve => setTimeout(resolve, 1000));

    const dataFilePaths = [
        '/Users/nitinbansal/Downloads/mix data/generated.csv',
        '/Users/nitinbansal/Downloads/hackernews_10k_dec 2024.csv'
    ];
    const tableNames = ['table_1', 'table_2'];

    const connection = await initConnection();

    await loadFile(connection, dataFilePaths[0], tableNames[0]);
    await loadFile(connection, dataFilePaths[1], tableNames[1]);

    let rows = null;
    for(let k = 0; k < 200; k++) {
        console.log(`iteration: ${k+1}`);

        // runUsingStreamAndReadUntil
        // runUsingStream
        rows = await runUsingStreamAndReadUntil(connection, `select * from table_1`, 2048);
        console.log(`selected row count ${rows.length}`)

        rows = await runUsingStreamAndReadUntil(connection, `select * from table_2`, 2048);
        console.log(`selected row count ${rows.length}`)
    }
}

main().catch(console.error);

And here are the data files links (please uncompress larger file)

  1. Hackernews: https://drive.google.com/file/d/1I-f5eDWVIOYonBXmSh-PdqgqJchijOOJ/view?usp=sharing
  2. Generated: https://drive.google.com/file/d/1kGwu62p8YAjKmEdAxCLSsp4HMl9q1Z9o/view?usp=sharing

Let me know if any other details are needed.

Thanks..

@jraymakers
Copy link
Contributor

Thanks for the detailed reproduction! I think I see what's going on now.

The problem is a misunderstanding about how the getRows method of DuckDBResultReader (returned by streamAndReadUntil) behaves. It returns all the rows that have been read so far. This is different from the fetchChunk method of DuckDBResult (returned by stream), which tries to fetch the next chunk, and will return null or a zero-length chunk if there are none left to fetch.

So, the while(true) loop in runUsingStreamAndReadUntil will never terminate; it will continue to get and append the same set of rows until memory runs out.

(Note that DuckDBResult also has a getRows method, which fully materializes the result and returns all rows, even if it's a streaming result.)

I apologize for not making this distinction clearer. Obviously the various methods could use more documentation. I'll move that up in my priority list.

@jraymakers
Copy link
Contributor

Regarding your other comments about connections:

The way DuckDB connections behave is different from other databases. It may take some getting used to. While the lifetime of some state is tied to a connection, such as temporary objects, most in-memory state has a lifetime tied to the instance, not the connection. An instance can have more than one connection. Closing a connection does not affect the instance-level state.

Specifically, if you create a (non-temporary) table using one connection to an instance, and then close that connection and create a new connection to the same instance, that table will still be around. The way to remove a table from an instance is either to DROP it or DETACH its containing database.

Additionally, connections in DuckDB are not network connections. They're just objects in memory. They cannot go stale. There's no need for, or benefit from, issuing occasional queries (e.g. select 1) on a connection.

In my understanding, these are some recommended patterns of use for DuckDB:

  • Create a single instance per process. (Using multiple instances in the same process might be useful in some advanced scenarios; if this is done, special care must be taken if the same data files are attached to both, or corruption can occur.)
  • Create multiple connections to an instance if you want to run multiple queries in parallel, or if you want different scopes for temporary objects (or other per-connection state such as prepared statements, settings, and variables).
  • Keep connections and instances around as long as you'd like. No special action is needed to keep them usable.

(Also, some context: While I communicate frequently with the DuckDB team as part of my work, both as a member of the MotherDuck team and as an author of Node Neo, I didn't build DuckDB nor do I control its design. The above is what I've learned, not something I have any control over.)

@freakynit
Copy link
Contributor Author

@jraymakers Thanks.. verified working without crashing with this:

const result = await connection.streamAndReadUntil(query, maxRowsToFetch);
return result.getRows().slice(0, maxRowsToFetch);

Regarding single connection, if I can rely on it not going stale (since it's just like any other variable), that makes things much simple.

And yes, docs do need an update.

Thanks for your time..

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants