Skip to content

Blob Encoding not working with Lance Java SDK #5167

@rahil-c

Description

@rahil-c

Issue Summary

Hi all, recently I have been working on adding lance file format support within Apache Hudi: apache/hudi#14127 and wanted to raise a potential issue(hopefully my understanding is correct of this feature)

During the integration, I wanted to test out storing binary content and leverage the blob encoding feature that Lance mentions here: https://lancedb.github.io/lance/guide/blob/?h=blob, to avoid materializing these blobs on each read and just get back a struct of the position and size.

However during my testing with the LanceFileReader via the Java sdk I would find that the blob contents are always materialized despite passing the property "lance-encoding:blob": "true during the intial write using the LanceFileWriter.

To isolate the issue to only lance, I have the following unit test which is just using the LanceFileWriter and LanceFileReader to show the reproduction. Should be easy to copy into an IDE and run and i have attached results as well.

Reproduction

package org.apache.hudi.io.storage;

import com.lancedb.lance.file.LanceFileReader;
import com.lancedb.lance.file.LanceFileWriter;
import org.apache.arrow.memory.BufferAllocator;
import org.apache.arrow.memory.RootAllocator;
import org.apache.arrow.vector.LargeVarBinaryVector;
import org.apache.arrow.vector.VectorSchemaRoot;
import org.apache.arrow.vector.ipc.ArrowReader;
import org.apache.arrow.vector.types.pojo.ArrowType;
import org.apache.arrow.vector.types.pojo.Field;
import org.apache.arrow.vector.types.pojo.FieldType;
import org.apache.arrow.vector.types.pojo.Schema;
import org.junit.jupiter.api.Test;
import org.junit.jupiter.api.io.TempDir;

import java.nio.file.Path;
import java.util.*;

import static org.junit.jupiter.api.Assertions.*;

public class BlobEncodingTest {

    @Test
    public void testBlobEncodingReturnsDescriptors(@TempDir Path tempDir) throws Exception {
        String filePath = tempDir.resolve("test_blob.lance").toString();
        BufferAllocator allocator = new RootAllocator();

        // Step 1: Write blob-encoded data
        Map<String, String> blobMetadata = new HashMap<>();
        blobMetadata.put("lance-encoding:blob", "true");

        Field blobField = new Field(
                "blob_data",
                new FieldType(true, ArrowType.LargeBinary.INSTANCE, null, blobMetadata),
                Collections.emptyList()
        );

        Schema schema = new Schema(Collections.singletonList(blobField), null);

        try (LanceFileWriter writer = LanceFileWriter.open(
                filePath,
                allocator,
                null,
                Collections.emptyMap()
        )) {
            try (VectorSchemaRoot root = VectorSchemaRoot.create(schema, allocator)) {
                root.allocateNew();

                LargeVarBinaryVector blobVector =
                        (LargeVarBinaryVector) root.getVector("blob_data");

                // Write 5 blobs
                for (int i = 0; i < 5; i++) {
                    byte[] data = new byte[100 * (i + 1)]; // Different sizes
                    Arrays.fill(data, (byte) i);
                    blobVector.setSafe(i, data);
                }

                root.setRowCount(5);
                writer.write(root);
            }
        }

        // Step 2: Read back and verify
        try (LanceFileReader reader = LanceFileReader.open(filePath, allocator)) {

            // Check schema
            Schema readSchema = reader.schema();
            Field readField = readSchema.getFields().get(0);

            System.out.println("=== SCHEMA VERIFICATION ===");
            System.out.println("Field name: " + readField.getName());
            System.out.println("Field type: " + readField.getType());
            System.out.println("Field metadata: " + readField.getMetadata());

            // Check if blob metadata is preserved
            assertTrue(readField.getMetadata().containsKey("lance-encoding:blob"),
                    "Blob metadata should be preserved in schema");

            // Read batch
            try (ArrowReader batch = reader.readAll(null, null, 10)) {
                batch.loadNextBatch();  // Actually load the data
                VectorSchemaRoot root = batch.getVectorSchemaRoot();

                System.out.println("\n=== READ BATCH VERIFICATION ===");
                System.out.println("Batch schema: " + root.getSchema());
                System.out.println("Row count: " + root.getRowCount());

                // Get the blob column
                org.apache.arrow.vector.FieldVector column = root.getVector("blob_data");
                System.out.println("Column type: " + column.getField().getType());

                // Check if it's a struct with position and size (means the blob encoding happened)
                if (column.getField().getType() instanceof ArrowType.Struct) {
                    System.out.println("SUCCESS: Returned DESCRIPTORS (struct type)");
                    System.out.println("Struct has fields: " + column.getField().getChildren());

                    // The struct should have 'position' and 'size' fields
                    assertEquals(2, column.getField().getChildren().size(),
                            "Struct should have 2 fields (position and size)");

                } else if (column.getField().getType() instanceof ArrowType.LargeBinary) {
                    System.out.println("ISSUE: Returned MATERIALIZED BYTES (binary type)");
                    System.out.println("This means blobs were fetched from external buffers");

                    // This is what currently happens - Java materializes
                    LargeVarBinaryVector binaryVector = (LargeVarBinaryVector) column;

                    System.out.println("\n=== ACTUAL DATA (Materialized) ===");
                    for (int i = 0; i < Math.min(5, root.getRowCount()); i++) {
                        byte[] data = binaryVector.get(i);
                        System.out.println("Row " + i + ": " + data.length + " bytes");
                    }

                    // Fail the test to demonstrate the issue
                    fail("Java LanceFileReader materializes blobs instead of returning descriptors. " +
                            "Expected struct<position: uint64, size: uint64> but got " +
                            column.getField().getType());

                } else {
                    fail("Unexpected type: " + column.getField().getType());
                }
            }
        }
        allocator.close();
    }
}

Results of Repro

Image

Questions

  • Is blob encoding a feature that is only supported at the lance table format level, or should ideally it just work with lance file format?
  • If the file format supports this blob encoding feature then does this issue happen with the rust or python sdk?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions