-
Couldn't load subscription status.
- Fork 3.9k
ARROW-1780 - JDBC Adapter to convert Relational Data objects to Arrow Data Format Vector Objects #1759
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…or objects creation.
Used to YAML for the test data. Fixed issues in the vector creation for CLOB.
Fixed code to handle only one column in select query.
…or objects creation.
Used to YAML for the test data. Fixed issues in the vector creation for CLOB.
Fixed code to handle only one column in select query.
Added new test file
Conflicts: java/adapter/jdbc/src/main/java/org/apache/arrow/adapter/jdbc/JdbcToArrow.java java/adapter/jdbc/src/main/java/org/apache/arrow/adapter/jdbc/JdbcToArrowUtils.java java/adapter/jdbc/src/test/java/org/apache/arrow/adapter/jdbc/JdbcToArrowTest.java
…ush in forked branch
Pull Request created for merging Code Coverage related changes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You might want to checkstyle results: it looks like the way the code is indented is pretty different from the rest of the arrow code base...
| public static VectorSchemaRoot sqlToArrow(ResultSet resultSet) throws SQLException, IOException { | ||
| Preconditions.checkNotNull(resultSet, "JDBC ResultSet object can not be null"); | ||
|
|
||
| return sqlToArrow(resultSet, Calendar.getInstance()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Timezone/Locale should always be specified (UTC, Locale.ROOT)?
|
|
||
| RootAllocator rootAllocator = new RootAllocator(Integer.MAX_VALUE); | ||
| VectorSchemaRoot root = sqlToArrow(resultSet, rootAllocator, calendar); | ||
| rootAllocator.close(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if the allocator is closed, I guess it means data is invalidated? You might prefer not to provide this method...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This (rootAllocator.close()) was already removed and probably is referring to earlier code. So, we should be okay now.
| case Types.VARBINARY: | ||
| case Types.LONGVARBINARY: | ||
| updateVector((VarBinaryVector)root.getVector(columnName), | ||
| // rs.getBytes(i), !rs.wasNull(), rowCount); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to be removed?
| } | ||
|
|
||
| } catch (Exception e) { | ||
| e.printStackTrace(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should probably left as-is if you want the test framework to fail properly?
|
|
||
| } catch (Exception e) { | ||
| e.printStackTrace(); | ||
| } finally { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what about using try(with-resources) pattern?
java/adapter/jdbc/pom.xml
Outdated
|
|
||
| </dependencies> | ||
|
|
||
| <build> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you check/fix the indentation?
| } | ||
|
|
||
| private static void updateVector(BitVector bitVector, boolean value, boolean isNonNull, int rowCount) { | ||
| NullableBitHolder holder = new NullableBitHolder(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it better to use the holder vs calling directly bitVector.setSafe(rowCount, isNonNull ? 1 : 0, value ? 1: 0) (cc @siddharthteotia )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can continue to use holder so as to be consistent with other parts of the code. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think its fine to use any APIs exposed by vectors
| if (read == -1) { | ||
| break; | ||
| } | ||
| arrowBuf.setBytes(total, new ByteArrayInputStream(bytes, 0, read), read); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you don't need to wrap to a stream and that there's a method accepting an byte[] argument directly
| assertEquals(rowCount, varBinaryVector.getValueCount()); | ||
|
|
||
| for(int j = 0; j < varBinaryVector.getValueCount(); j++){ | ||
| assertEquals(Arrays.hashCode(values[j]), Arrays.hashCode(varBinaryVector.get(j))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not checking for array equality using assertArrayEquals?
| binary_field12 BINARY(100), varchar_field13 VARCHAR(256), blob_field14 BLOB, clob_field15 CLOB, char_field16 CHAR(16), bit_field17 BIT);' | ||
|
|
||
| data: | ||
| - 'INSERT INTO table1 VALUES (101, 1, 45, 12000, 92233720, 17345667789.23, 56478356785.345, 56478356785.345, PARSEDATETIME(''12:45:35 GMT'', ''HH:mm:ss z''), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I noticed that all the rows are basically the same for all the tests? is there any specific reason for it? (compared to only have one row for example...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no particular reason for that. It just gives you more number of rows.
Files committed to merge changes made for review comment implementation
|
@atuldambalkar @YashpalThakur @yashpal is this no longer WIP? If so, can you update the PR title? I think this needs a last look from @siddharthteotia before merging |
|
@wesm We are pretty much done with respect to code review comments changes and now waiting for last comments or PR merge from @laurentgo and @siddharthteotia |
|
Removed the [WIP] from the title. |
| Double[] arr = new Double[values.length]; | ||
| int i = 0; | ||
| for (String str: values) { | ||
| arr[i++] = Double.parseDouble(str); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this indentation is correct and consistent with what is followed in rest of the codebase
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please do checkstyle validation on all files?
|
LGTM. But I think we need to fix indentation that has been followed in this patch. Seems like it is different from what is there is rest of the Java code base in Arrow. @atuldambalkar , can you please do checkstyle validation? I was under the impression that it was a part of travis-ci build |
|
Thanks, @siddharthteotia for your comments. I have fixed the indentation for all the JDBC adapter code files based on the checkstyle warnings. Please take a look. |
|
@siddharthteotia it may have been better to use the merge tool for this patch since there were multiple authors (at least 2) -- we can do some squashing ourselves to try to preserve at least the authorship (though in this case it would be a lot of work). GitHub doesn't handle multiple authors in the merge UI as far as I can tell. I think this is an exceptional case; if something like this happens again we should either clean up the commits before merging, or use the GitHub UI and write the author's names in the commit message |
… Data Format Vector Objects (apache#1759) This patch adds JDBC adapter support for Arrow
This code enhancement is for converting JDBC ResultSet Relational objects to Arrow columnar data Vector objects. Code is under director "java/adapter/jdbc/src/main".
The API has following static methods in the
class org.apache.arrow.adapter.jdbc.JdbcToArrow -
public static VectorSchemaRoot sqlToArrow(Connection connection, String query)
public static ArrowDataFetcher jdbcArrowDataFetcher(Connection connection, String tableName)
Utility uses following data mapping to convert JDBC/SQL data types to Arrow data types -
CHAR --> ArrowType.Utf8
NCHAR --> ArrowType.Utf8
VARCHAR --> ArrowType.Utf8
NVARCHAR --> ArrowType.Utf8
LONGVARCHAR --> ArrowType.Utf8
LONGNVARCHAR --> ArrowType.Utf8
NUMERIC --> ArrowType.Decimal(precision, scale)
DECIMAL --> ArrowType.Decimal(precision, scale)
BIT --> ArrowType.Bool
TINYINT --> ArrowType.Int(8, signed)
SMALLINT --> ArrowType.Int(16, signed)
INTEGER --> ArrowType.Int(32, signed)
BIGINT --> ArrowType.Int(64, signed)
REAL --> ArrowType.FloatingPoint(FloatingPointPrecision.SINGLE)
FLOAT --> ArrowType.FloatingPoint(FloatingPointPrecision.SINGLE)
DOUBLE --> ArrowType.FloatingPoint(FloatingPointPrecision.DOUBLE)
BINARY --> ArrowType.Binary
VARBINARY --> ArrowType.Binary
LONGVARBINARY --> ArrowType.Binary
DATE --> ArrowType.Date(DateUnit.MILLISECOND)
TIME --> ArrowType.Time(TimeUnit.MILLISECOND, 32)
TIMESTAMP --> ArrowType.Timestamp(TimeUnit.MILLISECOND, timezone=null)
CLOB --> ArrowType.Utf8
BLOB --> ArrowType.Binary
JUnit test cases under java/adapter/jdbc/src/test. Test cases uses H2 in-memory database.
I am still working on adding and automating additional test cases.