-
Notifications
You must be signed in to change notification settings - Fork 540
Description
Issue Summary
Hi all, recently I have been working on adding lance file format support within Apache Hudi: apache/hudi#14127 and wanted to raise a potential issue(hopefully my understanding is correct of this feature)
During the integration, I wanted to test out storing binary content and leverage the blob encoding feature that Lance mentions here: https://lancedb.github.io/lance/guide/blob/?h=blob, to avoid materializing these blobs on each read and just get back a struct of the position and size.
However during my testing with the LanceFileReader via the Java sdk I would find that the blob contents are always materialized despite passing the property "lance-encoding:blob": "true during the intial write using the LanceFileWriter.
To isolate the issue to only lance, I have the following unit test which is just using the LanceFileWriter and LanceFileReader to show the reproduction. Should be easy to copy into an IDE and run and i have attached results as well.
Reproduction
package org.apache.hudi.io.storage;
import com.lancedb.lance.file.LanceFileReader;
import com.lancedb.lance.file.LanceFileWriter;
import org.apache.arrow.memory.BufferAllocator;
import org.apache.arrow.memory.RootAllocator;
import org.apache.arrow.vector.LargeVarBinaryVector;
import org.apache.arrow.vector.VectorSchemaRoot;
import org.apache.arrow.vector.ipc.ArrowReader;
import org.apache.arrow.vector.types.pojo.ArrowType;
import org.apache.arrow.vector.types.pojo.Field;
import org.apache.arrow.vector.types.pojo.FieldType;
import org.apache.arrow.vector.types.pojo.Schema;
import org.junit.jupiter.api.Test;
import org.junit.jupiter.api.io.TempDir;
import java.nio.file.Path;
import java.util.*;
import static org.junit.jupiter.api.Assertions.*;
public class BlobEncodingTest {
@Test
public void testBlobEncodingReturnsDescriptors(@TempDir Path tempDir) throws Exception {
String filePath = tempDir.resolve("test_blob.lance").toString();
BufferAllocator allocator = new RootAllocator();
// Step 1: Write blob-encoded data
Map<String, String> blobMetadata = new HashMap<>();
blobMetadata.put("lance-encoding:blob", "true");
Field blobField = new Field(
"blob_data",
new FieldType(true, ArrowType.LargeBinary.INSTANCE, null, blobMetadata),
Collections.emptyList()
);
Schema schema = new Schema(Collections.singletonList(blobField), null);
try (LanceFileWriter writer = LanceFileWriter.open(
filePath,
allocator,
null,
Collections.emptyMap()
)) {
try (VectorSchemaRoot root = VectorSchemaRoot.create(schema, allocator)) {
root.allocateNew();
LargeVarBinaryVector blobVector =
(LargeVarBinaryVector) root.getVector("blob_data");
// Write 5 blobs
for (int i = 0; i < 5; i++) {
byte[] data = new byte[100 * (i + 1)]; // Different sizes
Arrays.fill(data, (byte) i);
blobVector.setSafe(i, data);
}
root.setRowCount(5);
writer.write(root);
}
}
// Step 2: Read back and verify
try (LanceFileReader reader = LanceFileReader.open(filePath, allocator)) {
// Check schema
Schema readSchema = reader.schema();
Field readField = readSchema.getFields().get(0);
System.out.println("=== SCHEMA VERIFICATION ===");
System.out.println("Field name: " + readField.getName());
System.out.println("Field type: " + readField.getType());
System.out.println("Field metadata: " + readField.getMetadata());
// Check if blob metadata is preserved
assertTrue(readField.getMetadata().containsKey("lance-encoding:blob"),
"Blob metadata should be preserved in schema");
// Read batch
try (ArrowReader batch = reader.readAll(null, null, 10)) {
batch.loadNextBatch(); // Actually load the data
VectorSchemaRoot root = batch.getVectorSchemaRoot();
System.out.println("\n=== READ BATCH VERIFICATION ===");
System.out.println("Batch schema: " + root.getSchema());
System.out.println("Row count: " + root.getRowCount());
// Get the blob column
org.apache.arrow.vector.FieldVector column = root.getVector("blob_data");
System.out.println("Column type: " + column.getField().getType());
// Check if it's a struct with position and size (means the blob encoding happened)
if (column.getField().getType() instanceof ArrowType.Struct) {
System.out.println("SUCCESS: Returned DESCRIPTORS (struct type)");
System.out.println("Struct has fields: " + column.getField().getChildren());
// The struct should have 'position' and 'size' fields
assertEquals(2, column.getField().getChildren().size(),
"Struct should have 2 fields (position and size)");
} else if (column.getField().getType() instanceof ArrowType.LargeBinary) {
System.out.println("ISSUE: Returned MATERIALIZED BYTES (binary type)");
System.out.println("This means blobs were fetched from external buffers");
// This is what currently happens - Java materializes
LargeVarBinaryVector binaryVector = (LargeVarBinaryVector) column;
System.out.println("\n=== ACTUAL DATA (Materialized) ===");
for (int i = 0; i < Math.min(5, root.getRowCount()); i++) {
byte[] data = binaryVector.get(i);
System.out.println("Row " + i + ": " + data.length + " bytes");
}
// Fail the test to demonstrate the issue
fail("Java LanceFileReader materializes blobs instead of returning descriptors. " +
"Expected struct<position: uint64, size: uint64> but got " +
column.getField().getType());
} else {
fail("Unexpected type: " + column.getField().getType());
}
}
}
allocator.close();
}
}
Results of Repro
Questions
- Is blob encoding a feature that is only supported at the lance table format level, or should ideally it just work with lance file format?
- If the file format supports this blob encoding feature then does this issue happen with the rust or python sdk?