Skip to content

Conversation

@jayantsing-db
Copy link
Collaborator

Description

This PR contains changes from the PR #966 as well.

Introduce ColumnarRowView to provide direct access to columnar data without
materialising entire result sets into row objects. This reduces memory
allocations by allowing individual cell access via getValue(row, col)
instead of creating List<List<Object>> structures.

Key changes:

  • New ColumnarRowView class with direct columnar access methods.
  • Updated LazyThriftResult to use columnar view instead of materialized rows.
  • Added utility method in DatabricksThriftUtil for creating columnar views.
  • Comprehensive test coverage for all column types and null handling.

This optimization maintains API compatibility while significantly reducing
memory overhead for large result sets.

Following the changes introduced in PR #966, the following improvements were
observed during a test that executes a SQL query retrieving 5 million rows:

Current heap usage over time:
image

Improved heap usage over time:
image

  • The first image shows multiple significant CPU usage spikes reaching around 15%, while the second image shows consistently flat CPU usage at approximately 3%.
  • The erratic CPU behavior in the "before" state has been completely smoothed out.
  • CPU usage is now consistent and predictable rather than volatile.
  • Memory usage dropped from a peak of 8440 MB down to just 745 MB - that's roughly an 91% decrease.
  • The large memory spike and sustained high usage in the first image has been completely resolved.
  • Memory usage is now consistently low and stable throughout the monitoring period.
  • There is no negative effect on execution time; in fact, it appears to improve, likely due to more active GC pauses in the previous state.

Testing

  • Unit tests
  • e2e tests
  • FakeService tests

Additional Notes to the Reviewer

}

/** Interface for accessing column values by index without materializing the entire column. */
private interface ColumnAccessor {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use separate files for interface and impl

if (column.isSetStringVal()) return column.getStringVal().getValuesSize();

throw new DatabricksSQLException(
"Unsupported column type: " + column, DatabricksDriverErrorCode.UNSUPPORTED_OPERATION);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about complex datatypes? Will they also be covered in above primitive types?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We only support these

private static List<?> getColumnValues(TColumn column) throws DatabricksSQLException {
columns. There is nothing new added or removed in these changes. If complex types come as binary (which i think is the case), complex types are supported. Otherwise, not and this is the current behaviour too.

* out of bounds
*/
@Override
public Object getObject(int columnIndex) throws DatabricksSQLException {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will this work out of box? You return primitive types from ColumnAccessor, and here we can have complex types as well. Will the conversion happen implicitly?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a binary type as well. Added more details #975 (comment) in this comment.

@jayantsing-db jayantsing-db enabled auto-merge (squash) September 10, 2025 12:42
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces a memory-efficient columnar data access mechanism for JDBC result processing. Instead of materializing entire result sets into List<List<Object>> structures, it provides direct access to columnar data through a new ColumnarRowView class, resulting in significant memory reduction (up to 91% in testing) and improved CPU performance.

  • Introduces ColumnarRowView class for memory-efficient row-by-row data access
  • Updates LazyThriftResult to use columnar views instead of materialized row lists
  • Adds utility method in DatabricksThriftUtil for creating columnar views

Reviewed Changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

File Description
ColumnarRowView.java New class providing direct columnar access with getValue(row, col) method and null handling
LazyThriftResult.java Refactored to use ColumnarRowView instead of List<List<Object>> for batch processing
DatabricksThriftUtil.java Added createColumnarView() utility method as memory-efficient alternative
ColumnarRowViewTest.java Comprehensive test coverage for all column types, null handling, and boundary conditions

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Comment on lines +144 to +154
private final BitSet nullBits;

public TypedColumnAccessor(List<T> values, byte[] nulls) {
this.values = values;
this.nullBits = nulls != null ? BitSet.valueOf(nulls) : null;
}

@Override
public Object getValue(int rowIndex) {
if (nullBits != null && nullBits.get(rowIndex)) {
return null;
Copy link

Copilot AI Sep 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Creating a new BitSet for every column accessor could be memory-intensive for large datasets. Consider lazy initialization or caching the BitSet creation to improve memory efficiency.

Suggested change
private final BitSet nullBits;
public TypedColumnAccessor(List<T> values, byte[] nulls) {
this.values = values;
this.nullBits = nulls != null ? BitSet.valueOf(nulls) : null;
}
@Override
public Object getValue(int rowIndex) {
if (nullBits != null && nullBits.get(rowIndex)) {
return null;
private final byte[] nulls;
private BitSet nullBits;
public TypedColumnAccessor(List<T> values, byte[] nulls) {
this.values = values;
this.nulls = nulls;
this.nullBits = null; // Lazy initialization
}
@Override
public Object getValue(int rowIndex) {
if (nulls != null) {
if (nullBits == null) {
nullBits = BitSet.valueOf(nulls);
}
if (nullBits.get(rowIndex)) {
return null;
}

Copilot uses AI. Check for mistakes.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice suggestion. Thanks. Will implement in subsequent PR.

@jayantsing-db jayantsing-db merged commit 43ac32b into databricks:main Sep 11, 2025
12 of 13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants