-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Possible misuse of CAPI causing wal
files to remain intact and mek DB inaccessible
#83
Comments
This was resolved in the linked thread. |
@jraymakers Just wanted to say... the new Awesome work! Thanks.. May I sugest to add the usage of |
Glad to hear it! I'll think about how I can make the different ways of running results more clear. There are a lot of combinations, so it's tricky to document them all clearly. I did add a bit about |
@jraymakers yep.. i think so many very similar looking API's are adding a bit cognitive load when deciding which one to use. Maybe grouping them by use-cases might help. |
@jraymakers Was trying new
Test codeasync function run7(connection) {
const dataFilePaths = [
'/Users/nitinbansal/Downloads/mix data/generated.csv',
'/Users/nitinbansal/Downloads/hackernews_10k_dec 2024.csv'
];
const tableNames = ['table_1', 'table_2'];
await useConnection(DB_FILE_PATH, async (connection) => {
await dropTables(connection, true);
console.log(`Tables dropped. Loading data and getting count...`);
await loadFile(connection, dataFilePaths[0], tableNames[0], 3);
await loadFile(connection, dataFilePaths[1], tableNames[1], 3);
let result = null;
for(let k = 0; k < 200; k++) {
console.log(`iteration: ${k+1}`);
result = await runUsingStreamAndReadUntil(connection, `select * from table_1`, 2048);
console.log(`selected row count ${result?.rows?.length}`)
result = await runUsingStreamAndReadUntil(connection, `select * from table_2`, 2048);
console.log(`selected row count ${result?.rows?.length}`)
}
});
}
async function useConnection(dbFilePath, callback) {
let connection = null;
try {
connection = await initConnection(dbFilePath);
return await callback(connection); // Return the result of the callback
} catch (error) {
console.error("Error while using connection:", error);
throw error;
} finally {
if(connection) {
connection.disconnect();
// connection = null; // todo: prod: use close() if available
}
}
}
async function initConnection(dbFilePath) {
const instance = await DuckDBInstance.create(dbFilePath);
return await instance.connect();
}
async function runUsingStreamAndReadUntil(connection, query, maxRowsToFetch = 2048) {
const result = await connection.streamAndReadUntil(query, maxRowsToFetch);
const collectedRows = [];
while(true) {
const rows = result.getRows();
if (!rows || rows.length === 0) {
break;
}
console.log(`got ${rows.length} rows`);
rows.forEach(row => collectedRows.push(row));
}
return collectedRows;
} Source file details
Also, from docs and PR both, it's not clear how to proceed and use while(true) {
const chunk = await result.fetchChunk();
if (!chunk || chunk.rowCount === 0) {
break;
}
for (const row of chunk.getRows()) {
collectedRows.push(row);
}
} |
@jraymakers continuing with above... I know it's heap related error, but, the same code using |
It looks like you're running into a 2 GB heap limit in your Node environment. Since one of your files is 1 GB and you're creating multiple DuckDB instances in a loop, it's not very surprising you'd hit this limit. A few options/recommendations:
I agree the documentation of the various ways to run SQL and get results could be improved, but I suspect that isn't what's causing your problem here. If you think it is, please send the exact differences. The |
@jraymakers Thanks for the info. A few points:
I'm mostly from java background, so, im a bit biased to think that way, but, the current behaviour does seem odd where API is not doing fully what it's saying. Attaching GIF recording of same showing how
|
Attaching video recording link instead. This shows without doubt the observed differencehttps://drive.google.com/file/d/1HM3J0DKNvY8FhN3EaBiML_DvWZltrtCI/view?usp=sharing Also, here's the code (I have packed all relevant methods in a single file for easier testing)import {DuckDBInstance} from "@duckdb/node-api";
import fs from 'node:fs';
async function initConnection(dbFilePath) {
const instance = await DuckDBInstance.create(dbFilePath);
return await instance.connect();
}
async function runUsingStreamAndReadUntil(connection, query, maxRowsToFetch = 2048) {
const result = await connection.streamAndReadUntil(query, maxRowsToFetch);
const collectedRows = [];
while(true) {
const rows = result.getRows();
if (!rows || rows.length === 0) {
break;
}
for (const row of rows) {
collectedRows.push(row);
}
}
return collectedRows;
}
async function runUsingStream(connection, query, maxRowsToFetch = 2048) {
const result = await connection.stream(query);
const collectedRows = [];
let rowsFetched = 0;
while (maxRowsToFetch < 0 || (maxRowsToFetch > -1 && rowsFetched < maxRowsToFetch)) {
const chunk = await result.fetchChunk();
if (!chunk || chunk.rowCount === 0) {
break;
}
for (const row of chunk.getRows()) {
if (maxRowsToFetch > -1 && rowsFetched >= maxRowsToFetch) break;
collectedRows.push(row);
rowsFetched++;
}
}
return collectedRows;
}
async function loadFile(connection, filepath, tableName) {
let query = `CREATE TABLE ${tableName} AS SELECT * FROM read_csv_auto('${filepath}')`;
await connection.run(query);
}
async function deleteFile(filePath) {
try {
await fs.unlinkSync(filePath);
} catch (err) {
if (err.code !== 'ENOENT') {
throw err;
} else {
// ignore
}
}
}
async function main() {
const DB_FILE_PATH = './db.bin';
await deleteFile(DB_FILE_PATH);
await new Promise(resolve => setTimeout(resolve, 1000));
const dataFilePaths = [
'/Users/nitinbansal/Downloads/mix data/generated.csv',
'/Users/nitinbansal/Downloads/hackernews_10k_dec 2024.csv'
];
const tableNames = ['table_1', 'table_2'];
const connection = await initConnection();
await loadFile(connection, dataFilePaths[0], tableNames[0]);
await loadFile(connection, dataFilePaths[1], tableNames[1]);
let rows = null;
for(let k = 0; k < 200; k++) {
console.log(`iteration: ${k+1}`);
// runUsingStreamAndReadUntil
// runUsingStream
rows = await runUsingStreamAndReadUntil(connection, `select * from table_1`, 2048);
console.log(`selected row count ${rows.length}`)
rows = await runUsingStreamAndReadUntil(connection, `select * from table_2`, 2048);
console.log(`selected row count ${rows.length}`)
}
}
main().catch(console.error); And here are the data files links (please
|
Thanks for the detailed reproduction! I think I see what's going on now. The problem is a misunderstanding about how the So, the (Note that I apologize for not making this distinction clearer. Obviously the various methods could use more documentation. I'll move that up in my priority list. |
Regarding your other comments about connections: The way DuckDB connections behave is different from other databases. It may take some getting used to. While the lifetime of some state is tied to a connection, such as temporary objects, most in-memory state has a lifetime tied to the instance, not the connection. An instance can have more than one connection. Closing a connection does not affect the instance-level state. Specifically, if you create a (non-temporary) table using one connection to an instance, and then close that connection and create a new connection to the same instance, that table will still be around. The way to remove a table from an instance is either to Additionally, connections in DuckDB are not network connections. They're just objects in memory. They cannot go stale. There's no need for, or benefit from, issuing occasional queries (e.g. In my understanding, these are some recommended patterns of use for DuckDB:
(Also, some context: While I communicate frequently with the DuckDB team as part of my work, both as a member of the MotherDuck team and as an author of Node Neo, I didn't build DuckDB nor do I control its design. The above is what I've learned, not something I have any control over.) |
@jraymakers Thanks.. verified working without crashing with this: const result = await connection.streamAndReadUntil(query, maxRowsToFetch);
return result.getRows().slice(0, maxRowsToFetch); Regarding single connection, if I can rely on it not going stale (since it's just like any other variable), that makes things much simple. And yes, docs do need an update. Thanks for your time.. |
Here is the context (have a look at the reply on this too):
duckdb/duckdb#10002 (comment)
The text was updated successfully, but these errors were encountered: