-
Notifications
You must be signed in to change notification settings - Fork 28.6k
[SPARK-29644][SQL] Corrected ShortType and ByteType mapping to SmallInt and TinyInt in JDBCUtils #26301
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…e write/read tables from dataframe with colms as shorttype
...integration-tests/src/test/scala/org/apache/spark/sql/jdbc/MsSqlServerIntegrationSuite.scala
Show resolved
Hide resolved
… JDBCWriterSuite and E2E test case in dockers integration suite
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this change includes all the fixes we have been talking about in related PRs? It fixes two general problems in JDBCUtils, and one or two other mapping issues in MS SQL support? that could be OK. I think we can remove WIP from the title as it looks like a good proposal.
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala
Show resolved
Hide resolved
Test build #4913 has finished for PR 26301 at commit
|
@@ -202,4 +202,46 @@ class MsSqlServerIntegrationSuite extends DockerJDBCIntegrationSuite { | |||
df2.write.jdbc(jdbcUrl, "datescopy", new Properties) | |||
df3.write.jdbc(jdbcUrl, "stringscopy", new Properties) | |||
} | |||
|
|||
test("Write tables with ShortType") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a bug, so can you append the JIRA ID in the prefix?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@maropu is JIRA addition required at all places i have added a new test cases?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, plz follow test("SPARK-29644: test title")
Yes, with this PR we are fixing the core issue in JDBCUtils. So changes in MSSQL dialect are not required. (on WIP) I still have some work pending to fix some test. Once that's done , will move this from WIP to ready. |
I just didn't check this because of WIP though, I personally think we could merge this if all the test failures fixed. |
…getByte and getShort instead of getInt. Added PR number to test title as suggested
Test build #4925 has finished for PR 26301 at commit
|
Test build #4924 has finished for PR 26301 at commit
|
Test build #4928 has finished for PR 26301 at commit
|
still WIP? |
No, we are good. Test passed. I am moving this to ready now. |
...integration-tests/src/test/scala/org/apache/spark/sql/jdbc/MsSqlServerIntegrationSuite.scala
Show resolved
Hide resolved
...integration-tests/src/test/scala/org/apache/spark/sql/jdbc/MsSqlServerIntegrationSuite.scala
Show resolved
Hide resolved
2ac5829
to
9cdad2b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Retest this please. |
Thank you for pinging me. I also agree with you guys about the follow-up for test case refactoring. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @maropu @srowen @dongjoon-hyun. 'll also create a request to port this change to 2.4 |
Test build #113737 has finished for PR 26301 at commit
|
Sure. Thanks! |
## What changes were proposed in this pull request? This is a port of SPARK-26499 to 2.4 Modifed JdbcUtils.makeGetter to handle ByteType. ## How was this patch tested? Added a new test to JDBCSuite that maps ```TINYINT``` to ```ByteType```. Closes #23400 from twdsilva/tiny_int_support. Authored-by: Thomas D'Silva <tdsilvaapache.org> Signed-off-by: Hyukjin Kwon <gurwls223apache.org> ### Why are the changes needed? Changes are required to port pr #26301 (SPARK-29644) to 2.4 ### Does this PR introduce any user-facing change? No ### How was this patch tested? Yes, tested on 2.4 with using docker integration test for MySQL, MsSQLServer, Postgres ``sh /build/mvn test -Pdocker-integration-tests -pl :spark-docker-integration-tests_2.11 -Dtest=none -DwildcardSuites=org.apache.spark.sql.jdbc.MySQLIntegrationSuite ./build/mvn test -Pdocker-integration-tests -pl :spark-docker-integration-tests_2.11 -Dtest=none -DwildcardSuites=org.apache.spark.sql.jdbc.MsSqlServerIntegrationSuite ./build/mvn test -Pdocker-integration-tests -pl :spark-docker-integration-tests_2.11 -Dtest=none -DwildcardSuites=org.apache.spark.sql.jdbc.PostgresIntegrationSuite `` Closes #26531 from shivsood/pr_26499_2.4_port. Lead-authored-by: shivsood <shivsood@microsoft.com> Co-authored-by: Thomas D'Silva <tdsilva@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have four major comments about this PR:
- The changes made in this PR introduced the external changes. I believe we should document it in the migration guide.
- Also, TINYINT => IntegerType is wrong? Mapping to ByteType looks strange.
- More importantly, introducing such a behavior change in our maintenance releases should be avoided.
- Any such a change should check the boundary values for each data type for each dialect; otherwise, it is easy to introduce the regression. Obviously, the test case coverage is not good now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have four major comments about this PR:
- The changes made in this PR introduced the external changes. I believe we should document it in the migration guide.
- Also, TINYINT => IntegerType is wrong? Mapping to ByteType looks strange.
- More importantly, introducing such a behavior change in our maintenance releases should be avoided.
- Any such a change should check the boundary values for each data type for each dialect; otherwise, it is easy to introduce the regression. Obviously, the test case coverage is not good now.
@@ -59,7 +59,7 @@ class MsSqlServerIntegrationSuite extends DockerJDBCIntegrationSuite { | |||
""" | |||
|INSERT INTO numbers VALUES ( | |||
|0, | |||
|255, 32767, 2147483647, 9223372036854775807, | |||
|127, 32767, 2147483647, 9223372036854775807, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto - #26549 (comment). Can anyone explain why it was possible, and how do we handle unsigned cases?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Continuing from #26549 (comment) : TINYINT looks like it's a single byte alright, so using ByteType is reasonable. However it looks like it's treated as signed by some but not all DBMSes. Is it unsigned in SQL Server?
Just checking: these types like TINYINT and SMALLINT are not standard types, although widely supported, right? should these types be used by default for all JDBC sources?
Yeah I have some more doubts now that the TINYINT issue was pointed out. @shivsood
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. @gatorsmile , @HyukjinKwon , @srowen . I overlooked that mismatch between TINYINT vs Byte in this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SMALLINT
is widely supported because it's SQL-92. For TINYINT, I agree that we need to revert partially for that type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks all for pointing out these issues. I had overlooked handling of unsigned cases and the fact that each database may define on its own. I think the problem exists for both SMALLINT and TINYINT.
- TINYINT is very clear that it can be range any where from -127 to +127 and unsigned as 0 to 255. Both SQLServer and MySQL
- SMALLINT can take an unsigned value 0 to 65535 per MySQL
My understanding is that there are no unsigned type in Spark. c.f. https://spark.apache.org/docs/latest/sql-reference.html. Is that assertion right?
Do we have a test for an integer where integer value is 4294967295? Per MySQL documentation that's possible that an unsigned integer will have that value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The behavior would be as follows.
Overwrite scenario
- (ShortType -> Int) If a spark df has a ShortType, on overwrite a DBMSS table with type Int will get created. Because Spark ShortyType is -32768 to +32767 only these values can be written.
- (ByteType -> SmallInt) In a spark df has a ByteType, on overwrite DBMSS table with type Short will get created. Because Spark ShortyType is -128 to +127 only these values can be written.
Read scenario
-
If an existing table in DBMSS has type TinyInt, a read in Spark would results in a ShortType. Because ShortType range in Spark in -32786 to +32768, DBMSS signed value -127 to +127 and unsigned range of 0 to 255 will be handled.
-
If an existing table in DBMSS has type SmallInt, a read would result in Spark dataframe having a column type as Integer. Because Integer range in Spark in -2147483648 to +2147483648, DBMSS signed value --32768 to +32768 as well as unsigned range of 0 to 65535 will be handled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So we have to use a widening conversion both ways. That's the safest thing to do, I guess, and less of a change from the current behavior, where bytes go all the way to ints.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ur, I see. right.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Anyway, thanks for the explanation, @shivsood
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Raised PR #27172 with the proposed fix. ShortType is unchanged, only ByteType fix is modified to map to ShortType on the read path so enable support for 0 to 255 range.
@maropu @srowen @HyukjinKwon as FYI. Thanks
What changes were proposed in this pull request?
Corrected ShortType and ByteType mapping to SmallInt and TinyInt, corrected setter methods to set ShortType and ByteType as setShort() and setByte(). Changes in JDBCUtils.scala
Fixed Unit test cases to where applicable and added new E2E test cases in to test table read/write using ShortType and ByteType.
Problems
case java.sql.Types.TINYINT => IntegerType
At line 172 ShortType was wrongly interpreted as IntegerType
case ShortType => Option(JdbcType("INTEGER", java.sql.Types.SMALLINT))
All thru out tests, ShortType and ByteType were being interpreted as IntegerTypes.
Why are the changes needed?
A given type should be set using the right type.
Does this PR introduce any user-facing change?
No
How was this patch tested?
Corrected Unit test cases where applicable. Validated in CI/CD
Added a test case in MsSqlServerIntegrationSuite.scala, PostgresIntegrationSuite.scala , MySQLIntegrationSuite.scala to write/read tables from dataframe with cols as shorttype and bytetype. Validated by manual as follows.