-
Notifications
You must be signed in to change notification settings - Fork 27
Add columnar data access for memory-efficient row processing #975
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add columnar data access for memory-efficient row processing #975
Conversation
| } | ||
|
|
||
| /** Interface for accessing column values by index without materializing the entire column. */ | ||
| private interface ColumnAccessor { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use separate files for interface and impl
| if (column.isSetStringVal()) return column.getStringVal().getValuesSize(); | ||
|
|
||
| throw new DatabricksSQLException( | ||
| "Unsupported column type: " + column, DatabricksDriverErrorCode.UNSUPPORTED_OPERATION); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what about complex datatypes? Will they also be covered in above primitive types?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We only support these
databricks-jdbc/src/main/java/com/databricks/jdbc/common/util/DatabricksThriftUtil.java
Line 230 in 970c4c8
| private static List<?> getColumnValues(TColumn column) throws DatabricksSQLException { |
| * out of bounds | ||
| */ | ||
| @Override | ||
| public Object getObject(int columnIndex) throws DatabricksSQLException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will this work out of box? You return primitive types from ColumnAccessor, and here we can have complex types as well. Will the conversion happen implicitly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a binary type as well. Added more details #975 (comment) in this comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR introduces a memory-efficient columnar data access mechanism for JDBC result processing. Instead of materializing entire result sets into List<List<Object>> structures, it provides direct access to columnar data through a new ColumnarRowView class, resulting in significant memory reduction (up to 91% in testing) and improved CPU performance.
- Introduces
ColumnarRowViewclass for memory-efficient row-by-row data access - Updates
LazyThriftResultto use columnar views instead of materialized row lists - Adds utility method in
DatabricksThriftUtilfor creating columnar views
Reviewed Changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
ColumnarRowView.java |
New class providing direct columnar access with getValue(row, col) method and null handling |
LazyThriftResult.java |
Refactored to use ColumnarRowView instead of List<List<Object>> for batch processing |
DatabricksThriftUtil.java |
Added createColumnarView() utility method as memory-efficient alternative |
ColumnarRowViewTest.java |
Comprehensive test coverage for all column types, null handling, and boundary conditions |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
| private final BitSet nullBits; | ||
|
|
||
| public TypedColumnAccessor(List<T> values, byte[] nulls) { | ||
| this.values = values; | ||
| this.nullBits = nulls != null ? BitSet.valueOf(nulls) : null; | ||
| } | ||
|
|
||
| @Override | ||
| public Object getValue(int rowIndex) { | ||
| if (nullBits != null && nullBits.get(rowIndex)) { | ||
| return null; |
Copilot
AI
Sep 11, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Creating a new BitSet for every column accessor could be memory-intensive for large datasets. Consider lazy initialization or caching the BitSet creation to improve memory efficiency.
| private final BitSet nullBits; | |
| public TypedColumnAccessor(List<T> values, byte[] nulls) { | |
| this.values = values; | |
| this.nullBits = nulls != null ? BitSet.valueOf(nulls) : null; | |
| } | |
| @Override | |
| public Object getValue(int rowIndex) { | |
| if (nullBits != null && nullBits.get(rowIndex)) { | |
| return null; | |
| private final byte[] nulls; | |
| private BitSet nullBits; | |
| public TypedColumnAccessor(List<T> values, byte[] nulls) { | |
| this.values = values; | |
| this.nulls = nulls; | |
| this.nullBits = null; // Lazy initialization | |
| } | |
| @Override | |
| public Object getValue(int rowIndex) { | |
| if (nulls != null) { | |
| if (nullBits == null) { | |
| nullBits = BitSet.valueOf(nulls); | |
| } | |
| if (nullBits.get(rowIndex)) { | |
| return null; | |
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice suggestion. Thanks. Will implement in subsequent PR.
Description
This PR contains changes from the PR #966 as well.
Introduce ColumnarRowView to provide direct access to columnar data without
materialising entire result sets into row objects. This reduces memory
allocations by allowing individual cell access via
getValue(row, col)instead of creating
List<List<Object>>structures.Key changes:
This optimization maintains API compatibility while significantly reducing
memory overhead for large result sets.
Following the changes introduced in PR #966, the following improvements were
observed during a test that executes a SQL query retrieving 5 million rows:
Current heap usage over time:

Improved heap usage over time:

Testing
Additional Notes to the Reviewer