Skip to content

Conversation

@atuldambalkar
Copy link
Contributor

This code enhancement is for converting JDBC ResultSet Relational objects to Arrow columnar data Vector objects. Code is under director "java/adapter/jdbc/src/main".

The API has following static methods in the

class org.apache.arrow.adapter.jdbc.JdbcToArrow -

public static VectorSchemaRoot sqlToArrow(Connection connection, String query)
public static ArrowDataFetcher jdbcArrowDataFetcher(Connection connection, String tableName)

Utility uses following data mapping to convert JDBC/SQL data types to Arrow data types -
CHAR --> ArrowType.Utf8
NCHAR --> ArrowType.Utf8
VARCHAR --> ArrowType.Utf8
NVARCHAR --> ArrowType.Utf8
LONGVARCHAR --> ArrowType.Utf8
LONGNVARCHAR --> ArrowType.Utf8
NUMERIC --> ArrowType.Decimal(precision, scale)
DECIMAL --> ArrowType.Decimal(precision, scale)
BIT --> ArrowType.Bool
TINYINT --> ArrowType.Int(8, signed)
SMALLINT --> ArrowType.Int(16, signed)
INTEGER --> ArrowType.Int(32, signed)
BIGINT --> ArrowType.Int(64, signed)
REAL --> ArrowType.FloatingPoint(FloatingPointPrecision.SINGLE)
FLOAT --> ArrowType.FloatingPoint(FloatingPointPrecision.SINGLE)
DOUBLE --> ArrowType.FloatingPoint(FloatingPointPrecision.DOUBLE)
BINARY --> ArrowType.Binary
VARBINARY --> ArrowType.Binary
LONGVARBINARY --> ArrowType.Binary
DATE --> ArrowType.Date(DateUnit.MILLISECOND)
TIME --> ArrowType.Time(TimeUnit.MILLISECOND, 32)
TIMESTAMP --> ArrowType.Timestamp(TimeUnit.MILLISECOND, timezone=null)
CLOB --> ArrowType.Utf8
BLOB --> ArrowType.Binary

JUnit test cases under java/adapter/jdbc/src/test. Test cases uses H2 in-memory database.

I am still working on adding and automating additional test cases.

Atul Dambalkar and others added 30 commits February 15, 2018 06:22
Used to YAML for the test data.
Fixed issues in the vector creation for CLOB.
Fixed code to handle only one column in select query.
Used to YAML for the test data.
Fixed issues in the vector creation for CLOB.
Fixed code to handle only one column in select query.
Conflicts:
	java/adapter/jdbc/src/main/java/org/apache/arrow/adapter/jdbc/JdbcToArrow.java
	java/adapter/jdbc/src/main/java/org/apache/arrow/adapter/jdbc/JdbcToArrowUtils.java
	java/adapter/jdbc/src/test/java/org/apache/arrow/adapter/jdbc/JdbcToArrowTest.java
Copy link
Contributor

@laurentgo laurentgo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might want to checkstyle results: it looks like the way the code is indented is pretty different from the rest of the arrow code base...

public static VectorSchemaRoot sqlToArrow(ResultSet resultSet) throws SQLException, IOException {
Preconditions.checkNotNull(resultSet, "JDBC ResultSet object can not be null");

return sqlToArrow(resultSet, Calendar.getInstance());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Timezone/Locale should always be specified (UTC, Locale.ROOT)?


RootAllocator rootAllocator = new RootAllocator(Integer.MAX_VALUE);
VectorSchemaRoot root = sqlToArrow(resultSet, rootAllocator, calendar);
rootAllocator.close();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the allocator is closed, I guess it means data is invalidated? You might prefer not to provide this method...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This (rootAllocator.close()) was already removed and probably is referring to earlier code. So, we should be okay now.

case Types.VARBINARY:
case Types.LONGVARBINARY:
updateVector((VarBinaryVector)root.getVector(columnName),
// rs.getBytes(i), !rs.wasNull(), rowCount);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to be removed?

}

} catch (Exception e) {
e.printStackTrace();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should probably left as-is if you want the test framework to fail properly?


} catch (Exception e) {
e.printStackTrace();
} finally {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about using try(with-resources) pattern?


</dependencies>

<build>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you check/fix the indentation?

}

private static void updateVector(BitVector bitVector, boolean value, boolean isNonNull, int rowCount) {
NullableBitHolder holder = new NullableBitHolder();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it better to use the holder vs calling directly bitVector.setSafe(rowCount, isNonNull ? 1 : 0, value ? 1: 0) (cc @siddharthteotia )

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can continue to use holder so as to be consistent with other parts of the code. What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think its fine to use any APIs exposed by vectors

if (read == -1) {
break;
}
arrowBuf.setBytes(total, new ByteArrayInputStream(bytes, 0, read), read);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you don't need to wrap to a stream and that there's a method accepting an byte[] argument directly

assertEquals(rowCount, varBinaryVector.getValueCount());

for(int j = 0; j < varBinaryVector.getValueCount(); j++){
assertEquals(Arrays.hashCode(values[j]), Arrays.hashCode(varBinaryVector.get(j)));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not checking for array equality using assertArrayEquals?

binary_field12 BINARY(100), varchar_field13 VARCHAR(256), blob_field14 BLOB, clob_field15 CLOB, char_field16 CHAR(16), bit_field17 BIT);'

data:
- 'INSERT INTO table1 VALUES (101, 1, 45, 12000, 92233720, 17345667789.23, 56478356785.345, 56478356785.345, PARSEDATETIME(''12:45:35 GMT'', ''HH:mm:ss z''),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed that all the rows are basically the same for all the tests? is there any specific reason for it? (compared to only have one row for example...)

Copy link
Contributor Author

@atuldambalkar atuldambalkar Jun 1, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no particular reason for that. It just gives you more number of rows.

@wesm
Copy link
Member

wesm commented Jun 15, 2018

@atuldambalkar @YashpalThakur @yashpal is this no longer WIP? If so, can you update the PR title? I think this needs a last look from @siddharthteotia before merging

@atuldambalkar
Copy link
Contributor Author

@wesm We are pretty much done with respect to code review comments changes and now waiting for last comments or PR merge from @laurentgo and @siddharthteotia

@atuldambalkar atuldambalkar changed the title ARROW-1780 - [WIP] JDBC Adapter to convert Relational Data objects to Arrow Data Format Vector Objects ARROW-1780 - JDBC Adapter to convert Relational Data objects to Arrow Data Format Vector Objects Jun 15, 2018
@atuldambalkar
Copy link
Contributor Author

Removed the [WIP] from the title.

Double[] arr = new Double[values.length];
int i = 0;
for (String str: values) {
arr[i++] = Double.parseDouble(str);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this indentation is correct and consistent with what is followed in rest of the codebase

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please do checkstyle validation on all files?

@siddharthteotia
Copy link
Contributor

siddharthteotia commented Jun 15, 2018

LGTM. But I think we need to fix indentation that has been followed in this patch. Seems like it is different from what is there is rest of the Java code base in Arrow.

@atuldambalkar , can you please do checkstyle validation? I was under the impression that it was a part of travis-ci build

@atuldambalkar
Copy link
Contributor Author

Thanks, @siddharthteotia for your comments. I have fixed the indentation for all the JDBC adapter code files based on the checkstyle warnings. Please take a look.

@siddharthteotia
Copy link
Contributor

@wesm , @xhochy, what is the general practice when squashing commits and merging? This PR has 148 commits and do we want all those individual commit messages to be part of commit message in master? Or should the PR owner squash and push again with a proper commit message for the overall feature?

@siddharthteotia siddharthteotia merged commit e17f95d into apache:master Jun 19, 2018
@wesm
Copy link
Member

wesm commented Jun 21, 2018

@siddharthteotia it may have been better to use the merge tool for this patch since there were multiple authors (at least 2) -- we can do some squashing ourselves to try to preserve at least the authorship (though in this case it would be a lot of work). GitHub doesn't handle multiple authors in the merge UI as far as I can tell.

I think this is an exceptional case; if something like this happens again we should either clean up the commits before merging, or use the GitHub UI and write the author's names in the commit message

pribor pushed a commit to GlobalWebIndex/arrow that referenced this pull request Oct 24, 2025
… Data Format Vector Objects (apache#1759)

This patch adds JDBC adapter support for Arrow
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants