Skip to content

Conversation

@connortsui20
Copy link
Contributor

Tracking Issue: #4372

Changes checked out from: #4605

Fixes a bug in the DuckDB exporter where if we do not flatten / materialize the inner child of the fixed-size list before exporting the vector to DuckDB, DuckDB will segfault after running a function like scan.

Also fixes a bug where I didn't correctly append garbage values when the validity is all null (I took that from the list implementation, which doesn't need to append garbage values since null can be an empty list there). I don't actually know if this is an issue to be honest, but it seems wrong?

I was not able to figure out how to write regression tests for both of these. The early return might not actually be an issue, but the flattening at the end definitely is. However, I was not able to create a minimally reproducible test for that, and it only fails in query 7 and 8 in statpopgen.sql.

-- 7. Read just one variant (this is the sixth one).
SELECT *
  FROM statpopgen
 WHERE "CHROM" == 'chr21'
   AND "POS" == 5030278;
-- 8. Read a 700 base-pair window of variants.
SELECT *
  FROM statpopgen
 WHERE "CHROM" == 'chr21'
   AND "POS" >= 5030300
   AND "POS" <= 5031000;

If anyone has an idea of how to recreate this easily that would be appreciated!

Finally, I simplified the tests I wrote from before since a lot were redundant.

@connortsui20 connortsui20 added the changelog/fix A bug fix label Sep 12, 2025
@codecov
Copy link

codecov bot commented Sep 12, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 89.11%. Comparing base (4765a21) to head (cad70b1).
⚠️ Report is 1 commits behind head on develop.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link
Contributor

@danking danking left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think to cause a panic you would need to actually run a duckdb query. Given the failing queries in the benchmark, I suspect you need to do something that filters or takes elements of the table.

@connortsui20
Copy link
Contributor Author

So I asked Claude to research into why I was observing what I did, and I had it go look through the duckdb source code as well. Here is the results of that research. TLDR; I think we definitely need the flatten as there is no equivalent reserve for ARRAY type in DuckDB.

cc @danking

Understanding the flatten() Requirement for FixedSizeList with UTF8 in DuckDB

Executive Summary

The flatten() call on line 77 in vortex-duckdb/src/exporter/fixed_size_list.rs is absolutely critical for preventing segmentation faults when exporting FixedSizeList arrays containing UTF8 strings to DuckDB. Without it, DuckDB will crash when querying the data.

This requirement is due to fundamental architectural differences between how DuckDB's LIST (variable-size) and ARRAY (fixed-size) types manage their child vectors and string buffers.

The Problem

When we export a FixedSizeList<UTF8> to DuckDB without calling flatten():

  • The export appears to succeed
  • But when DuckDB queries the data, it segfaults
  • This ONLY happens with UTF8/string data, not with primitive types like i32
  • Regular List<UTF8> works fine without flatten()

Root Cause: LIST vs ARRAY Architecture in DuckDB

Key Insight

DuckDB's LIST and ARRAY types have fundamentally different memory management models for their child vectors:

  • LIST: Dynamically allocates and owns its child vector
  • ARRAY: Returns a view/reference to an embedded structure

This difference is critical when dealing with string data that requires auxiliary buffers.

Memory Layout Comparison

LIST Memory Structure

┌──────────────────────────────────────────────────────────────┐
│                         LIST Vector                          │
│ ┌──────────────────────────────────────────────────────────┐ │
│ │ Header:                                                  │ │
│ │  - Type: LIST                                            │ │
│ │  - Count: 3                                              │ │
│ └──────────────────────────────────────────────────────────┘ │
│                                                              │
│ ┌──────────────────────────────────────────────────────────┐ │
│ │ Validity Mask: [1, 0, 1]  (bit array)                    │ │
│ └──────────────────────────────────────────────────────────┘ │
│                                                              │
│ ┌──────────────────────────────────────────────────────────┐ │
│ │ List Entries Array:                                      │ │
│ │  ┌─────────────┬─────────────┐                           │ │
│ │  │ offset: 0   │ length: 3   │ <- List 0                 │ │
│ │  ├─────────────┼─────────────┤                           │ │
│ │  │ offset: 3   │ length: 0   │ <- List 1 (NULL)          │ │
│ │  ├─────────────┼─────────────┤                           │ │
│ │  │ offset: 3   │ length: 2   │ <- List 2                 │ │
│ │  └─────────────┴─────────────┘                           │ │
│ └──────────────────────────────────────────────────────────┘ │
│                                                              │
│ ┌──────────────────────────────────────────────────────────┐ │
│ │ Child Vector Pointer: 0x7fff8c024100 ──────────────────┐ │ │
│ └──────────────────────────────────────────────────────────┘ │
└──────────────────────────────────────────────────────────────┘
                                                           |
                                                           v
        ┌─────────────────────────────────────────────────────────────────┐
        │                  Child Vector (HEAP ALLOCATED)                  │
        │ ┌─────────────────────────────────────────────────────────────┐ │
        │ │ Header:                                                     │ │
        │ │  - Type: VARCHAR                                            │ │
        │ │  - Size: 5 (total elements across all lists)                │ │
        │ │  - Capacity: 100 (allocated via list_vector_reserve)        │ │
        │ └─────────────────────────────────────────────────────────────┘ │
        │                                                                 │
        │ ┌─────────────────────────────────────────────────────────────┐ │
        │ │ String Views Array:                                         │ │
        │ │  [0]: {len: 15, ptr: 0x8a001000}  <- "hello world..."       │ │
        │ │  [1]: {len: 20, ptr: 0x8a001100}  <- "another long string"  │ │
        │ │  [2]: {len: 8,  inline: "short"}   <- Inlined (<= 12 bytes) │ │
        │ │  [3]: {len: 25, ptr: 0x8a001200}  <- "more string data..."  │ │
        │ │  [4]: {len: 10, inline: "test data"} <- Inlined             │ │
        │ └─────────────────────────────────────────────────────────────┘ │
        │                                                                 │
        │ ┌─────────────────────────────────────────────────────────────┐ │
        │ │ String Buffer List (OWNED):                                 │ │
        │ │  ┌──────────────────────────────────────────┐               │ │
        │ │  │ Buffer 0: 0x8a001000                     │               │ │
        │ │  │  "hello world..."                        │               │ │
        │ │  ├──────────────────────────────────────────┤               │ │
        │ │  │ Buffer 1: 0x8a001100                     │               │ │
        │ │  │  "another long string"                   │               │ │
        │ │  ├──────────────────────────────────────────┤               │ │
        │ │  │ Buffer 2: 0x8a001200                     │               │ │
        │ │  │  "more string data..."                   │               │ │
        │ │  └──────────────────────────────────────────┘               │ │
        │ └─────────────────────────────────────────────────────────────┘ │
        └─────────────────────────────────────────────────────────────────┘

ARRAY Memory Structure (Without flatten - BROKEN)

┌──────────────────────────────────────────────────────────────┐
│                         ARRAY Vector                         │
│ ┌──────────────────────────────────────────────────────────┐ │
│ │ Header:                                                  │ │
│ │  - Type: ARRAY                                           │ │
│ │  - Array Size: 3 (fixed elements per array)              │ │
│ │  - Count: 2 (number of arrays)                           │ │
│ └──────────────────────────────────────────────────────────┘ │
│                                                              │
│ ┌──────────────────────────────────────────────────────────┐ │
│ │ Validity Mask: [1, 1]  (both arrays valid)               │ │
│ └──────────────────────────────────────────────────────────┘ │
│                                                              │
│ ┌──────────────────────────────────────────────────────────┐ │
│ │ NO Offset/Length Data!                                   │ │
│ │ (Array uses fixed positioning: index * array_size)       │ │
│ └──────────────────────────────────────────────────────────┘ │
│                                                              │
│ ┌──────────────────────────────────────────────────────────┐ │
│ │ Child Vector (EMBEDDED IN PARENT STRUCT):                │ │
│ │ ┌────────────────────────────────────────────────────┐   │ │
│ │ │ This is NOT a pointer! It's part of the struct:    │   │ │
│ │ │                                                    │   │ │
│ │ │ struct ArrayVector {                               │   │ │
│ │ │     ValidityMask validity;                         │   │ │
│ │ │     Vector child;  // <- EMBEDDED, not allocated   │   │ │
│ │ │     ...                                            │   │ │
│ │ │ }                                                  │   │ │
│ │ └────────────────────────────────────────────────────┘   │ │
│ │                                                          │ │
│ │ ┌────────────────────────────────────────────────────┐   │ │
│ │ │ Child Vector Data (pre-allocated):                 │   │ │
│ │ │  - Type: VARCHAR                                   │   │ │
│ │ │  - Fixed Size: STANDARD_VECTOR_SIZE * array_size   │   │ │
│ │ │  - This is a VIEW, not owned memory!               │   │ │
│ │ └────────────────────────────────────────────────────┘   │ │
│ │                                                          │ │
│ │ ┌────────────────────────────────────────────────────┐   │ │
│ │ │ String Views (in embedded buffer):                 │   │ │
│ │ │  [0]: {len: 15, ptr: 0x9b002000} <- Points to buff │   │ │
│ │ │  [1]: {len: 20, ptr: 0x9b002100} <- Points to buff │   │ │
│ │ │  [2]: {len: 8,  inline: "short"}                   │   │ │
│ │ │  [3]: {len: 25, ptr: 0x9b002200} <- Points to buff │   │ │
│ │ │  [4]: {len: 10, inline: "test"}                    │   │ │
│ │ │  [5]: {len: 18, ptr: 0x9b002300} <- Points to buff │   │ │
│ │ └────────────────────────────────────────────────────┘   │ │
│ │                                                          │ │
│ │ ┌────────────────────────────────────────────────────┐   │ │
│ │ │ String Buffers (PROBLEM - attached to VIEW!):      │   │ │
│ │ │  These are added via add_string_buffer() but:      │   │ │
│ │ │  - Attached to a view, not owned vector            │   │ │
│ │ │  - Unclear lifetime/ownership                      │   │ │
│ │ │  - May be freed when export function returns!      │   │ │
│ │ │                                                    │   │ │
│ │ │  Buffer 0: 0x9b002000 <- INVALID after return!     │   │ │
│ │ │  Buffer 1: 0x9b002100 <- INVALID after return!     │   │ │
│ │ │  Buffer 2: 0x9b002200 <- INVALID after return!     │   │ │
│ │ │  Buffer 3: 0x9b002300 <- INVALID after return!     │   │ │
│ │ └────────────────────────────────────────────────────┘   │ │
│ └──────────────────────────────────────────────────────────┘ │
└──────────────────────────────────────────────────────────────┘

[WARNING] DANGER: When export function returns, string buffer pointers become invalid!
         DuckDB will SEGFAULT when trying to access these strings!

ARRAY Memory Structure (With flatten - FIXED)


┌───────────────────────────────────────────────────────────────┐
│                         ARRAY Vector                          │
│ ┌──────────────────────────────────────────────────────────┐  │
│ │ Header:                                                  │  │
│ │  - Type: ARRAY                                           │  │
│ │  - Array Size: 3                                         │  │
│ │  - Count: 2                                              │  │
│ └──────────────────────────────────────────────────────────┘  │
│                                                               │
│ ┌──────────────────────────────────────────────────────────┐  │
│ │ Child Vector (AFTER flatten()):                          │  │
│ │ ┌────────────────────────────────────────────────────┐   │  │
│ │ │ flatten() creates a NEW FLAT VECTOR:               │   │  │
│ │ │  1. Allocates new buffer                           │   │  │
│ │ │  2. Copies all data                                │   │  │
│ │ │  3. Properly attaches string buffers               │   │  │
│ │ │  4. Ensures ownership and persistence              │   │  │
│ │ └────────────────────────────────────────────────────┘   │  │
│ │                                                          │  │
│ │ ┌────────────────────────────────────────────────────┐   │  │
│ │ │ Flattened Vector Data (PROPERLY OWNED):            │   │  │
│ │ │  - Type: VARCHAR                                   │   │  │
│ │ │  - Vector Type: FLAT_VECTOR                        │   │  │
│ │ │  - All data materialized and owned                 │   │  │
│ │ └────────────────────────────────────────────────────┘   │  │
│ │                                                          │  │
│ │ ┌────────────────────────────────────────────────────┐   │  │
│ │ │ String Buffers (PROPERLY OWNED & PERSISTED):       │   │  │
│ │ │  ┌──────────────────────────────────────────┐      │   │  │
│ │ │  │ Buffer 0: "hello world..."               │      │   │  │
│ │ │  │  - Properly attached to flattened vector │      │   │  │
│ │ │  │  - Survives after export returns         │      │   │  │
│ │ │  ├──────────────────────────────────────────┤      │   │  │
│ │ │  │ Buffer 1: "another long string"          │      │   │  │
│ │ │  │  - Reference counted                     │      │   │  │
│ │ │  │  - Owned by the vector                   │      │   │  │
│ │ │  ├──────────────────────────────────────────┤      │   │  │
│ │ │  │ Buffer 2: "more string data..."          │      │   │  │
│ │ │  │  - Valid for entire query lifetime       │      │   │  │
│ │ │  └──────────────────────────────────────────┘      │   │  │
│ │ └────────────────────────────────────────────────────┘   │  │
│ └──────────────────────────────────────────────────────────┘  │
└────────────────────────────────────────────────────────────────┘

[OK] SUCCESS: String buffers properly materialized and will persist!

Code Comparison

How LIST Manages Child Vectors

// list.rs - Working code that doesn't need flatten()
vector.list_vector_reserve(sum_of_list_lens)?;  // <- ALLOCATES a new buffer
let mut elements_vector = vector.list_vector_get_child();  // <- Returns OWNED vector
vector.list_vector_set_size(sum_of_list_lens)?;
self.elements_exporter.export(..., &mut elements_vector)?;
// No flatten() needed - child vector is properly owned!

How ARRAY Manages Child Vectors

// fixed_size_list.rs - Requires flatten() for strings
let mut elements_vector = vector.array_vector_get_child();  // <- Returns a VIEW
self.elements_exporter.export(..., &mut elements_vector)?;
elements_vector.flatten(element_count as u64);  // <- CRITICAL for strings!

Detailed Export Flow Diagrams

LIST with UTF8 Export Flow (Success)


┌─────────────────────────────────────────────────────────────────────┐
│ Step 1: list_vector_reserve(100)                                    │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│  LIST Vector                   Heap Memory                          │
│  ┌──────────┐                 ┌─────────────────────┐               │
│  │ child: ──┼────────────────>│ NEW Vector          │               │
│  └──────────┘                 │ capacity: 100       │               │
│                               │ buffer: allocated   │               │
│                               └─────────────────────┘               │
└─────────────────────────────────────────────────────────────────────┘
                                        │
                                        v
┌─────────────────────────────────────────────────────────────────────┐
│ Step 2: VarBinView exports strings                                  │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│  Child Vector                  String Buffers                       │
│  ┌──────────────┐             ┌─────────────────┐                   │
│  │ views: [...] │             │ Buffer 1        │                   │
│  │ buffers: ────┼────────────>│ "hello world"   │                   │
│  └──────────────┘             ├─────────────────┤                   │
│                               │ Buffer 2        │                   │
│                               │ "test string"   │                   │
│                               └─────────────────┘                   │
│                               (OWNED by child vector)               │
└─────────────────────────────────────────────────────────────────────┘
                                        │
                                        v
┌─────────────────────────────────────────────────────────────────────┐
│ Step 3: Export function returns                                     │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│  [OK] Child vector still exists (owned by LIST)                     │
│  [OK] String buffers still attached and valid                       │
│  [OK] All references remain valid                                   │
└─────────────────────────────────────────────────────────────────────┘
                                        │
                                        v
┌─────────────────────────────────────────────────────────────────────┐
│ Step 4: DuckDB queries the data                                     │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│  DuckDB: "Give me string at position 0"                             │
│  Vector: views[0] -> Buffer 1 -> "hello world" [OK]                 │
│                                                                     │
│  SUCCESS: All data accessible!                                      │
└─────────────────────────────────────────────────────────────────────┘

ARRAY with UTF8 Export Flow WITHOUT flatten (Failure)

┌─────────────────────────────────────────────────────────────────────┐
│ Step 1: array_vector_get_child()                                    │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│  ARRAY Vector                                                       │
│  ┌────────────────────────┐                                         │
│  │ struct ArrayVector {   │                                         │
│  │   validity: [...]      │                                         │
│  │   child: Vector {      │ <─ Returns pointer to this              │
│  │     data: [...]        │   (embedded struct member)              │
│  │   }                    │                                         │
│  │ }                      │                                         │
│  └────────────────────────┘                                         │
└─────────────────────────────────────────────────────────────────────┘
                                        │
                                        v
┌─────────────────────────────────────────────────────────────────────┐
│ Step 2: VarBinView exports strings to child VIEW                    │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│  Child View                    Temporary Buffers?                   │
│  ┌──────────────┐             ┌─────────────────┐                   │
│  │ views: [...] │             │ Buffer 1        │                   │
│  │ buffers: ────┼────???────> │ "hello world"   │                   │
│  └──────────────┘             └─────────────────┘                   │
│         ^                                                           │
│         │                     [WARNING] Attached to a VIEW!         │
│    Just a view!               Ownership unclear!                    │
└─────────────────────────────────────────────────────────────────────┘
                                        │
                                        v
┌─────────────────────────────────────────────────────────────────────┐
│ Step 3: Export function returns                                     │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│  [X] Child "vector" was just a view - not independently owned       │
│  [X] String buffer references may be freed                          │
│  [X] Pointers now point to invalid memory!                          │
│                                                                     │
│  Stack frame destroyed:                                             │
│  ┌──────────────┐             ┌─────────────────┐                   │
│  │ views: [...] │             │ FREED MEMORY    │                   │
│  │ buffers: ────┼────────────>│ ????????????    │                   │
│  └──────────────┘             └─────────────────┘                   │
└─────────────────────────────────────────────────────────────────────┘
                                        │
                                        v
┌─────────────────────────────────────────────────────────────────────┐
│ Step 4: DuckDB queries the data                                     │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│  DuckDB: "Give me string at position 0"                             │
│  Vector: views[0] -> 0x9b002000 -> ??????                           │
│                         ^                                           │
│                         │                                           │
│                    INVALID POINTER!                                 │
│                                                                     │
│  [CRASH] SEGMENTATION FAULT!                                        │
└─────────────────────────────────────────────────────────────────────┘

ARRAY with UTF8 Export Flow WITH flatten (Success)

┌─────────────────────────────────────────────────────────────────────┐
│ Step 1 & 2: Get child and export (same as above)                    │
└─────────────────────────────────────────────────────────────────────┘
                                        │
                                        v
┌─────────────────────────────────────────────────────────────────────┐
│ Step 3: flatten(element_count)                                      │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│  flatten() operation:                                               │
│                                                                     │
│  1. Create new FLAT_VECTOR:                                         │
│     ┌────────────────────┐                                          │
│     │ type: FLAT_VECTOR  │                                          │
│     │ buffer: NEW ALLOC  │                                          │
│     └────────────────────┘                                          │
│                                                                     │
│  2. Copy all data from view -> new buffer                           │
│     [view data] ──────copy──────> [new buffer]                      │
│                                                                     │
│  3. Re-attach string buffers to new vector:                         │
│     ┌────────────────────┐      ┌─────────────────┐                 │
│     │ new vector         │      │ String Buffer 1 │                 │
│     │ buffers: ──────────┼────> │ "hello world"   │                 │
│     └────────────────────┘      └─────────────────┘                 │
│                                  (Now properly owned!)              │
│                                                                     │
│  4. Update parent's child to point to flattened vector              │
└─────────────────────────────────────────────────────────────────────┘
                                        │
                                        v
┌─────────────────────────────────────────────────────────────────────┐
│ Step 4: Export function returns                                     │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│  [OK] Flattened data persists in parent's structure                 │
│  [OK] String buffers properly attached and owned                    │
│  [OK] All references remain valid                                   │
└─────────────────────────────────────────────────────────────────────┘
                                        │
                                        v
┌─────────────────────────────────────────────────────────────────────┐
│ Step 5: DuckDB queries the data                                     │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│  DuckDB: "Give me string at position 0"                             │
│  Vector: views[0] -> Buffer 1 -> "hello world" [OK]                 │
│                                                                     │
│  SUCCESS: All data accessible!                                      │
└─────────────────────────────────────────────────────────────────────┘

Why Only UTF8 Strings Are Affected

Primitive Types (Work Fine)


ARRAY Vector with i32 elements:
┌──────────────────────────────┐
│ Child (embedded):            │
│  ┌────────────────────────┐  │
│  │ i32 data: [1,2,3,4,5]  │  │ <- Data stored INLINE
│  └────────────────────────┘  │   No external buffers needed!
└──────────────────────────────┘   View model works fine

UTF8/String Types (Require flatten())

ARRAY Vector with UTF8 elements:
┌──────────────────────────────┐
│ Child (embedded):            │
│  ┌────────────────────────┐  │
│  │ string_t views:        │  │
│  │  [0]: ptr -> ???       │  │ <- Needs external buffer!
│  │  [1]: ptr -> ???       │  │ <- Buffer ownership unclear
│  └────────────────────────┘  │
└──────────────────────────────┘
         [WARNING] Without flatten(), buffer pointers invalid!

What flatten() Does

The flatten() operation performs these critical steps:

  1. Checks vector type: Determines if flattening is needed
  2. Allocates new buffer: Creates properly-sized flat vector
  3. Copies data: Transfers all data to new buffer
  4. Handles string buffers:
    • For VARCHAR types, ensures all string buffers are attached
    • For FSST compressed strings, decompresses them
    • Maintains reference counting for buffer ownership
  5. Updates vector type: Marks as FLAT_VECTOR
  6. Ensures persistence: Data now owned and will survive

Implementation Difference (Conceptual C++)

LIST's Child Vector Management

// Simplified conceptual implementation
class ListVector {
    Vector* child = nullptr;  // Heap-allocated child

    void Reserve(size_t size) {
        // ALLOCATES a new child vector
        this->child = new Vector(VARCHAR, size);
        this->child->buffer = allocate_buffer(size * sizeof(string_t));
    }

    Vector* GetChild() {
        return this->child;  // Returns OWNED pointer
    }

    ~ListVector() {
        delete this->child;  // Properly cleans up
    }
};

ARRAY's Child Vector Management

// Simplified conceptual implementation
class ArrayVector {
    Vector child;  // EMBEDDED member, not a pointer!
    size_t array_size;

    ArrayVector(size_t array_size, size_t count) {
        this->array_size = array_size;
        // Child is pre-initialized as part of struct
        this->child.data = this->internal_buffer;
        this->child.size = STANDARD_VECTOR_SIZE * array_size;
    }

    Vector* GetChild() {
        return &this->child;  // Returns ADDRESS of member
    }
    // No explicit cleanup needed - child is part of struct
};

Evidence from Testing

The failing tests demonstrate this issue clearly:

Tests That FAIL Without flatten()

  1. test_vortex_scan_nested_fixed_size_list_utf8:

    Structure: FSL[FSL[UTF8]]
    Result: SEGFAULT without flatten()
    Reason: Nested string buffers not properly materialized
    
  2. test_vortex_scan_ultra_deep_nesting:

    Structure: FSL[List[FSL[List[FSL[UTF8]]]]]
    Result: SEGFAULT without flatten()
    Reason: Deep nesting with strings at leaf level
    

Tests That WORK Without flatten()

  1. Nested integers:

    Structure: FSL[FSL[i32]]
    Result: Works fine
    Reason: No external buffers needed for i32
    
  2. Variable lists with strings:

    Structure: List[UTF8]
    Result: Works fine
    Reason: LIST allocates owned child vector
    
  3. Any depth with primitives:

    Structure: FSL[List[FSL[List[FSL[i64]]]]]
    Result: Works fine
    Reason: Primitive types stored inline
    

Why DuckDB Made This Design Choice

DuckDB's design reflects different optimization priorities:

LIST (Variable-Size) Design Goals

  • Flexibility: Different numbers of elements per row
  • Dynamic allocation: Allocates exactly what's needed
  • General purpose: Works for all scenarios
  • Trade-off: Slightly more overhead for management

ARRAY (Fixed-Size) Design Goals

  • Performance: No offset/length tracking needed
  • Cache efficiency: Predictable memory access patterns
  • SIMD optimization: Fixed strides enable vectorization
  • Direct indexing: element = base + (index * array_size)
  • Trade-off: Requires explicit materialization for complex types

Performance Characteristics

LIST Access Pattern

To access element at list[i][j]:
1. Read list_entry[i] -> {offset: 10, length: 5}
2. Calculate: child_index = offset + j = 10 + j
3. Access: child_vector[child_index]
(Requires offset lookup + addition)

ARRAY Access Pattern

To access element at array[i][j]:
1. Calculate: child_index = i * array_size + j
2. Access: child_vector[child_index]
(Direct calculation - no lookup needed!)

Why It's Required

  • ARRAY's child vector is an embedded view, not an owned allocation
  • String data requires external buffers that must be properly owned
  • Without flatten(), string buffer references become invalid after export
  • DuckDB crashes with segfault when accessing invalid string pointers

Why It's Not Needed for LIST

  • LIST allocates and owns its child vector via list_vector_reserve()
  • String buffers attached to owned vectors persist properly
  • The ownership chain is clear and maintained

Why It's Not Needed for Primitives

  • Primitive types store data inline in the vector's buffer
  • No external buffers or pointers involved
  • The view/embedded model works fine for inline data

@connortsui20 connortsui20 force-pushed the ct/duckdb-fsl-export-fix branch 2 times, most recently from 6f92313 to 3f23155 Compare September 12, 2025 20:17
@danking
Copy link
Contributor

danking commented Sep 12, 2025

I don't think I believe the LLM. The docs seem pretty clear that the child vector lives as long as the parent vector.

duckdb_array_vector_get_child

Retrieves the child vector of an array vector.

The resulting vector is valid as long as the parent vector is valid. The resulting vector has the size of the parent vector multiplied by the array size.

Moreover, when we export the elements (the strings), the varbinview exporter explicitly increments assigns shared ownership for those buffers to the (child) vector here.

I don't know what "VIEW" means in this context, I don't see that terminology in the duckdb docs at all.

@connortsui20
Copy link
Contributor Author

connortsui20 commented Sep 13, 2025

The docs seem pretty clear that the child vector lives as long as the parent vector.

To be clear, the issue that the LLM came up with was that the child vector does not own the string buffers (what I was referring to as the view) when they get added. Note that it is correct when it says duckdb_array_vector_get_child does not allocate (see the duckdb code here).

So yeah I think you are right, the issue comes from something related to the Utf8 "views" (or ownership/borrowing of the string buffers). Just so I can copy and paste this later, the add_string_buffer function call goes down to StringVector::AddBuffer.

I think on slack you said there might be an issue with the dictionaries? Maybe something here is unexpected for us?

/* In duckdb-1.3.2/src/include/duckdb/common/types.hpp */

template <class T, typename... ARGS>
buffer_ptr<T> make_buffer(ARGS &&...args) { // NOLINT: mimic std casing
    // I'm assuming this is the same as the constructor for `shared_ptr`.
	return make_shared_ptr<T>(std::forward<ARGS>(args)...);
}

<-- snip -->

/* In duckdb-1.3.2/src/common/types/vector.cpp */

// Returns a shared pointer to a string buffer (I think).
VectorStringBuffer &StringVector::GetStringBuffer(Vector &vector) {
	if (vector.GetType().InternalType() != PhysicalType::VARCHAR) {
		throw InternalException("StringVector::GetStringBuffer - vector is not of internal type VARCHAR but of type %s",
		                        vector.GetType());
	}
	if (!vector.auxiliary) {
		vector.auxiliary = make_buffer<VectorStringBuffer>();
	}
	D_ASSERT(vector.auxiliary->GetBufferType() == VectorBufferType::STRING_BUFFER);
	return vector.auxiliary.get()->Cast<VectorStringBuffer>();
}

<-- snip -->

void StringVector::AddBuffer(Vector &vector, buffer_ptr<VectorBuffer> buffer) {
	D_ASSERT(buffer.get() != vector.auxiliary.get());
	auto &string_buffer = GetStringBuffer(vector);
	string_buffer.AddHeapReference(std::move(buffer));
}

void StringVector::AddHeapReference(Vector &vector, Vector &other) {
	D_ASSERT(vector.GetType().InternalType() == PhysicalType::VARCHAR);
	D_ASSERT(other.GetType().InternalType() == PhysicalType::VARCHAR);

	if (other.GetVectorType() == VectorType::DICTIONARY_VECTOR) {
        // For dictionaries, we add a heap reference to the dictionary values.
        // Don't really know where the codes are added (no call to `DictionaryVector::SelVector`)...
		StringVector::AddHeapReference(vector, DictionaryVector::Child(other));
		return;
	}
	if (!other.auxiliary) {
		return;
	}
	StringVector::AddBuffer(vector, other.auxiliary);
}

At the very least this is probably enough to figure out what might be wrong? Though I think I don't know enough about the duckdb part of our codebase to solve this.

cc @danking and @0ax1 since I saw you looked at ownership here a few months back

@0ax1
Copy link
Contributor

0ax1 commented Sep 16, 2025

// Returns a shared pointer to a string buffer (I think).
VectorStringBuffer &StringVector::GetStringBuffer(Vector &vector) {

This does not copy the shared pointer, but hands out a ref to the shared pointer. The retain count of the shared pointer is therefore not +1 after calling this function

@connortsui20 connortsui20 force-pushed the ct/duckdb-fsl-export-fix branch from 3f23155 to 48774ae Compare September 17, 2025 15:09
@connortsui20 connortsui20 force-pushed the ct/duckdb-fsl-export-fix branch from 48774ae to 95e2754 Compare September 18, 2025 18:55
Also adds some regression tests

Signed-off-by: Connor Tsui <connor.tsui20@gmail.com>
Signed-off-by: Connor Tsui <connor.tsui20@gmail.com>
Signed-off-by: Connor Tsui <connor.tsui20@gmail.com>
@connortsui20 connortsui20 force-pushed the ct/duckdb-fsl-export-fix branch from f022387 to cad70b1 Compare September 18, 2025 18:59
@connortsui20 connortsui20 enabled auto-merge (squash) September 18, 2025 19:00
@connortsui20 connortsui20 merged commit 20dd173 into develop Sep 18, 2025
37 checks passed
@connortsui20 connortsui20 deleted the ct/duckdb-fsl-export-fix branch September 18, 2025 19:09
@0ax1
Copy link
Contributor

0ax1 commented Sep 18, 2025

We gotta follow up here on test_vortex_scan_ultra_deep_nesting which fires an assert in DuckDB when built in debug: https://github.com/duckdb/duckdb/blob/464ae747bcf33971679ae80bcce481adf34902c3/src/common/types/vector.cpp#L1802

I'll land the PR to opt into DuckDB debug builds before. https://github.com/vortex-data/vortex/compare/ad/duckdb-debug

@joseph-isaacs
Copy link
Contributor

@connortsui20 Why didn't you canoncalize the elements vector once in the exporter ctor?

@connortsui20
Copy link
Contributor Author

@joseph-isaacs Shouldn't the new_array_exporter function take care of that?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

changelog/fix A bug fix

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants