-
Notifications
You must be signed in to change notification settings - Fork 130
Fix: FixedSizeList DuckDB exporter bug
#4610
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
danking
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think to cause a panic you would need to actually run a duckdb query. Given the failing queries in the benchmark, I suspect you need to do something that filters or takes elements of the table.
|
So I asked Claude to research into why I was observing what I did, and I had it go look through the duckdb source code as well. Here is the results of that research. TLDR; I think we definitely need the cc @danking Understanding the
|
6f92313 to
3f23155
Compare
|
I don't think I believe the LLM. The docs seem pretty clear that the child vector lives as long as the parent vector.
Moreover, when we export the elements (the strings), the varbinview exporter explicitly increments assigns shared ownership for those buffers to the (child) vector here. I don't know what "VIEW" means in this context, I don't see that terminology in the duckdb docs at all. |
To be clear, the issue that the LLM came up with was that the child vector does not own the string buffers (what I was referring to as the view) when they get added. Note that it is correct when it says So yeah I think you are right, the issue comes from something related to the Utf8 "views" (or ownership/borrowing of the string buffers). Just so I can copy and paste this later, the I think on slack you said there might be an issue with the dictionaries? Maybe something here is unexpected for us? /* In duckdb-1.3.2/src/include/duckdb/common/types.hpp */
template <class T, typename... ARGS>
buffer_ptr<T> make_buffer(ARGS &&...args) { // NOLINT: mimic std casing
// I'm assuming this is the same as the constructor for `shared_ptr`.
return make_shared_ptr<T>(std::forward<ARGS>(args)...);
}
<-- snip -->
/* In duckdb-1.3.2/src/common/types/vector.cpp */
// Returns a shared pointer to a string buffer (I think).
VectorStringBuffer &StringVector::GetStringBuffer(Vector &vector) {
if (vector.GetType().InternalType() != PhysicalType::VARCHAR) {
throw InternalException("StringVector::GetStringBuffer - vector is not of internal type VARCHAR but of type %s",
vector.GetType());
}
if (!vector.auxiliary) {
vector.auxiliary = make_buffer<VectorStringBuffer>();
}
D_ASSERT(vector.auxiliary->GetBufferType() == VectorBufferType::STRING_BUFFER);
return vector.auxiliary.get()->Cast<VectorStringBuffer>();
}
<-- snip -->
void StringVector::AddBuffer(Vector &vector, buffer_ptr<VectorBuffer> buffer) {
D_ASSERT(buffer.get() != vector.auxiliary.get());
auto &string_buffer = GetStringBuffer(vector);
string_buffer.AddHeapReference(std::move(buffer));
}
void StringVector::AddHeapReference(Vector &vector, Vector &other) {
D_ASSERT(vector.GetType().InternalType() == PhysicalType::VARCHAR);
D_ASSERT(other.GetType().InternalType() == PhysicalType::VARCHAR);
if (other.GetVectorType() == VectorType::DICTIONARY_VECTOR) {
// For dictionaries, we add a heap reference to the dictionary values.
// Don't really know where the codes are added (no call to `DictionaryVector::SelVector`)...
StringVector::AddHeapReference(vector, DictionaryVector::Child(other));
return;
}
if (!other.auxiliary) {
return;
}
StringVector::AddBuffer(vector, other.auxiliary);
}At the very least this is probably enough to figure out what might be wrong? Though I think I don't know enough about the duckdb part of our codebase to solve this. cc @danking and @0ax1 since I saw you looked at ownership here a few months back |
This does not copy the shared pointer, but hands out a ref to the shared pointer. The retain count of the shared pointer is therefore not +1 after calling this function |
3f23155 to
48774ae
Compare
48774ae to
95e2754
Compare
Also adds some regression tests Signed-off-by: Connor Tsui <connor.tsui20@gmail.com>
Signed-off-by: Connor Tsui <connor.tsui20@gmail.com>
Signed-off-by: Connor Tsui <connor.tsui20@gmail.com>
f022387 to
cad70b1
Compare
|
We gotta follow up here on I'll land the PR to opt into DuckDB debug builds before. https://github.com/vortex-data/vortex/compare/ad/duckdb-debug |
|
@connortsui20 Why didn't you canoncalize the elements vector once in the exporter ctor? |
|
@joseph-isaacs Shouldn't the |
Tracking Issue: #4372
Changes checked out from: #4605
Fixes a bug in the DuckDB exporter where if we do not flatten / materialize the inner child of the fixed-size list before exporting the vector to DuckDB, DuckDB will segfault after running a function like
scan.Also fixes a bug where I didn't correctly append garbage values when the validity is all null (I took that from the list implementation, which doesn't need to append garbage values since null can be an empty list there). I don't actually know if this is an issue to be honest, but it seems wrong?
I was not able to figure out how to write regression tests for both of these. The early return might not actually be an issue, but the flattening at the end definitely is. However, I was not able to create a minimally reproducible test for that, and it only fails in query 7 and 8 in
statpopgen.sql.If anyone has an idea of how to recreate this easily that would be appreciated!
Finally, I simplified the tests I wrote from before since a lot were redundant.