Skip to content

[SPARK-29644][SQL] Corrected ShortType and ByteType mapping to SmallInt and TinyInt in JDBCUtils #26301

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 5 commits into from

Conversation

shivsood
Copy link
Contributor

@shivsood shivsood commented Oct 30, 2019

What changes were proposed in this pull request?

Corrected ShortType and ByteType mapping to SmallInt and TinyInt, corrected setter methods to set ShortType and ByteType as setShort() and setByte(). Changes in JDBCUtils.scala
Fixed Unit test cases to where applicable and added new E2E test cases in to test table read/write using ShortType and ByteType.

Problems

  • In master in JDBCUtils.scala line number 547 and 551 have a problem where ShortType and ByteType are set as Integers rather than set as Short and Byte respectively.
case ShortType =>
(stmt: PreparedStatement, row: Row, pos: Int) =>
stmt.setInt(pos + 1, row.getShort(pos))
The issue was pointed out by @maropu 

case ByteType =>
(stmt: PreparedStatement, row: Row, pos: Int) =>
 stmt.setInt(pos + 1, row.getByte(pos))
  • Also at line JDBCUtils.scala 247 TinyInt is interpreted wrongly as IntergetType in getCatalystType()

case java.sql.Types.TINYINT => IntegerType

  • At line 172 ShortType was wrongly interpreted as IntegerType
    case ShortType => Option(JdbcType("INTEGER", java.sql.Types.SMALLINT))

  • All thru out tests, ShortType and ByteType were being interpreted as IntegerTypes.

Why are the changes needed?

A given type should be set using the right type.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Corrected Unit test cases where applicable. Validated in CI/CD
Added a test case in MsSqlServerIntegrationSuite.scala, PostgresIntegrationSuite.scala , MySQLIntegrationSuite.scala to write/read tables from dataframe with cols as shorttype and bytetype. Validated by manual as follows.

./build/mvn install -DskipTests
./build/mvn test -Pdocker-integration-tests -pl :spark-docker-integration-tests_2.12

…e write/read tables from dataframe with colms as shorttype
@shivsood shivsood changed the title [SPARK-29644] [SQL] Corrected ShortType wrongly set as Int in JDBCUtils.scala. [WIP] [SPARK-29644] [SQL] Corrected ShortType wrongly set as Int in JDBCUtils.scala. Oct 30, 2019
… JDBCWriterSuite and E2E test case in dockers integration suite
@shivsood shivsood changed the title [WIP] [SPARK-29644] [SQL] Corrected ShortType wrongly set as Int in JDBCUtils.scala. [WIP] [SPARK-29644] [SQL] Corrected ShortType and ByteType mapping to SmallInt and TinyInt in JDBCUtils Nov 1, 2019
Copy link
Member

@srowen srowen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this change includes all the fixes we have been talking about in related PRs? It fixes two general problems in JDBCUtils, and one or two other mapping issues in MS SQL support? that could be OK. I think we can remove WIP from the title as it looks like a good proposal.

@SparkQA
Copy link

SparkQA commented Nov 2, 2019

Test build #4913 has finished for PR 26301 at commit 66dfd4c.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • assert(colType(0) == \"class java.lang.Short\")
  • assert(colType(0) == \"class java.lang.Byte\")

@@ -202,4 +202,46 @@ class MsSqlServerIntegrationSuite extends DockerJDBCIntegrationSuite {
df2.write.jdbc(jdbcUrl, "datescopy", new Properties)
df3.write.jdbc(jdbcUrl, "stringscopy", new Properties)
}

test("Write tables with ShortType") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a bug, so can you append the JIRA ID in the prefix?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@maropu is JIRA addition required at all places i have added a new test cases?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, plz follow test("SPARK-29644: test title")

@shivsood
Copy link
Contributor Author

shivsood commented Nov 5, 2019

So this change includes all the fixes we have been talking about in related PRs? It fixes two general problems in JDBCUtils, and one or two other mapping issues in MS SQL support? that could be OK. I think we can remove WIP from the title as it looks like a good proposal.

Yes, with this PR we are fixing the core issue in JDBCUtils. So changes in MSSQL dialect are not required.

(on WIP) I still have some work pending to fix some test. Once that's done , will move this from WIP to ready.

@srowen
Copy link
Member

srowen commented Nov 8, 2019

@shivsood @maropu I'd love to get this into Spark 3 as it seems like a good fix. Is this good to go other than the title and test descriptions? sounds like it will let us close or unblock some other changes.

@shivsood
Copy link
Contributor Author

shivsood commented Nov 8, 2019

@shivsood @maropu I'd love to get this into Spark 3 as it seems like a good fix. Is this good to go other than the title and test descriptions? sounds like it will let us close or unblock some other changes.

@srowen I have a failing test to fix. Will try to get that out by this weekend.

@maropu
Copy link
Member

maropu commented Nov 8, 2019

I just didn't check this because of WIP though, I personally think we could merge this if all the test failures fixed.

…getByte and getShort instead of getInt. Added PR number to test title as suggested
@SparkQA
Copy link

SparkQA commented Nov 11, 2019

Test build #4925 has finished for PR 26301 at commit 9cdad2b.

  • This patch fails Spark unit tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 11, 2019

Test build #4924 has finished for PR 26301 at commit 9cdad2b.

  • This patch fails Spark unit tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 11, 2019

Test build #4928 has finished for PR 26301 at commit 9cdad2b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu
Copy link
Member

maropu commented Nov 11, 2019

still WIP?

@shivsood
Copy link
Contributor Author

still WIP?

No, we are good. Test passed. I am moving this to ready now.

@shivsood shivsood changed the title [WIP] [SPARK-29644] [SQL] Corrected ShortType and ByteType mapping to SmallInt and TinyInt in JDBCUtils [SPARK-29644] [SQL] Corrected ShortType and ByteType mapping to SmallInt and TinyInt in JDBCUtils Nov 12, 2019
@shivsood shivsood force-pushed the shorttype_fix_maropu branch from 2ac5829 to 9cdad2b Compare November 13, 2019 22:04
Copy link
Member

@maropu maropu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dongjoon-hyun dongjoon-hyun changed the title [SPARK-29644] [SQL] Corrected ShortType and ByteType mapping to SmallInt and TinyInt in JDBCUtils [SPARK-29644][SQL] Corrected ShortType and ByteType mapping to SmallInt and TinyInt in JDBCUtils Nov 14, 2019
@dongjoon-hyun
Copy link
Member

Retest this please.

@dongjoon-hyun
Copy link
Member

Thank you for pinging me. I also agree with you guys about the follow-up for test case refactoring.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM. Merged to master.

I also verified this locally. This also passed with MsSqlServerIntegrationSuite, OracleIntegrationSuite, PostgresIntegrationSuite, MySQLIntegrationSuite.

Thank you so much, @shivsood , @maropu , @srowen .

@shivsood
Copy link
Contributor Author

Thank you @maropu @srowen @dongjoon-hyun. 'll also create a request to port this change to 2.4

@SparkQA
Copy link

SparkQA commented Nov 14, 2019

Test build #113737 has finished for PR 26301 at commit 0600894.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member

Sure. Thanks!

dongjoon-hyun pushed a commit that referenced this pull request Nov 15, 2019
## What changes were proposed in this pull request?
This is a port of SPARK-26499 to 2.4
Modifed JdbcUtils.makeGetter to handle ByteType.

## How was this patch tested?
Added a new test to JDBCSuite that maps ```TINYINT``` to ```ByteType```.
Closes #23400 from twdsilva/tiny_int_support.
Authored-by: Thomas D'Silva <tdsilvaapache.org>
Signed-off-by: Hyukjin Kwon <gurwls223apache.org>

### Why are the changes needed?
Changes are required to port pr #26301 (SPARK-29644) to 2.4

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
Yes, tested on 2.4 with using docker integration test for MySQL, MsSQLServer, Postgres
``sh
/build/mvn test -Pdocker-integration-tests -pl :spark-docker-integration-tests_2.11 -Dtest=none -DwildcardSuites=org.apache.spark.sql.jdbc.MySQLIntegrationSuite
./build/mvn test -Pdocker-integration-tests -pl :spark-docker-integration-tests_2.11 -Dtest=none -DwildcardSuites=org.apache.spark.sql.jdbc.MsSqlServerIntegrationSuite
./build/mvn test -Pdocker-integration-tests -pl :spark-docker-integration-tests_2.11 -Dtest=none -DwildcardSuites=org.apache.spark.sql.jdbc.PostgresIntegrationSuite
``

Closes #26531 from shivsood/pr_26499_2.4_port.

Lead-authored-by: shivsood <shivsood@microsoft.com>
Co-authored-by: Thomas D'Silva <tdsilva@apache.org>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
Copy link
Member

@gatorsmile gatorsmile left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have four major comments about this PR:

  • The changes made in this PR introduced the external changes. I believe we should document it in the migration guide.
  • Also, TINYINT => IntegerType is wrong? Mapping to ByteType looks strange.
  • More importantly, introducing such a behavior change in our maintenance releases should be avoided.
  • Any such a change should check the boundary values for each data type for each dialect; otherwise, it is easy to introduce the regression. Obviously, the test case coverage is not good now.

cc @shivsood @dongjoon-hyun @maropu @srowen

Copy link
Member

@gatorsmile gatorsmile left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have four major comments about this PR:

  • The changes made in this PR introduced the external changes. I believe we should document it in the migration guide.
  • Also, TINYINT => IntegerType is wrong? Mapping to ByteType looks strange.
  • More importantly, introducing such a behavior change in our maintenance releases should be avoided.
  • Any such a change should check the boundary values for each data type for each dialect; otherwise, it is easy to introduce the regression. Obviously, the test case coverage is not good now.

cc @shivsood @dongjoon-hyun @maropu @srowen

@@ -59,7 +59,7 @@ class MsSqlServerIntegrationSuite extends DockerJDBCIntegrationSuite {
"""
|INSERT INTO numbers VALUES (
|0,
|255, 32767, 2147483647, 9223372036854775807,
|127, 32767, 2147483647, 9223372036854775807,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto - #26549 (comment). Can anyone explain why it was possible, and how do we handle unsigned cases?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Continuing from #26549 (comment) : TINYINT looks like it's a single byte alright, so using ByteType is reasonable. However it looks like it's treated as signed by some but not all DBMSes. Is it unsigned in SQL Server?

Just checking: these types like TINYINT and SMALLINT are not standard types, although widely supported, right? should these types be used by default for all JDBC sources?

Yeah I have some more doubts now that the TINYINT issue was pointed out. @shivsood

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. @gatorsmile , @HyukjinKwon , @srowen . I overlooked that mismatch between TINYINT vs Byte in this PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SMALLINT is widely supported because it's SQL-92. For TINYINT, I agree that we need to revert partially for that type.

Copy link
Contributor Author

@shivsood shivsood Nov 19, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks all for pointing out these issues. I had overlooked handling of unsigned cases and the fact that each database may define on its own. I think the problem exists for both SMALLINT and TINYINT.

  • TINYINT is very clear that it can be range any where from -127 to +127 and unsigned as 0 to 255. Both SQLServer and MySQL
  • SMALLINT can take an unsigned value 0 to 65535 per MySQL

My understanding is that there are no unsigned type in Spark. c.f. https://spark.apache.org/docs/latest/sql-reference.html. Is that assertion right?

Do we have a test for an integer where integer value is 4294967295? Per MySQL documentation that's possible that an unsigned integer will have that value.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The behavior would be as follows.

Overwrite scenario

  1. (ShortType -> Int) If a spark df has a ShortType, on overwrite a DBMSS table with type Int will get created. Because Spark ShortyType is -32768 to +32767 only these values can be written.
  2. (ByteType -> SmallInt) In a spark df has a ByteType, on overwrite DBMSS table with type Short will get created. Because Spark ShortyType is -128 to +127 only these values can be written.

Read scenario

  1. If an existing table in DBMSS has type TinyInt, a read in Spark would results in a ShortType. Because ShortType range in Spark in -32786 to +32768, DBMSS signed value -127 to +127 and unsigned range of 0 to 255 will be handled.

  2. If an existing table in DBMSS has type SmallInt, a read would result in Spark dataframe having a column type as Integer. Because Integer range in Spark in -2147483648 to +2147483648, DBMSS signed value --32768 to +32768 as well as unsigned range of 0 to 65535 will be handled.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we have to use a widening conversion both ways. That's the safest thing to do, I guess, and less of a change from the current behavior, where bytes go all the way to ints.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ur, I see. right.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anyway, thanks for the explanation, @shivsood

Copy link
Contributor Author

@shivsood shivsood Jan 10, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Raised PR #27172 with the proposed fix. ShortType is unchanged, only ByteType fix is modified to map to ShortType on the read path so enable support for 0 to 255 range.

@maropu @srowen @HyukjinKwon as FYI. Thanks

dongjoon-hyun pushed a commit that referenced this pull request Nov 19, 2019
…o SmallInt and TinyInt in JDBCUtils

This reverts commit f7e53865 i.e PR #26301 from master

Closes #26583 from shivsood/revert_29644_master.

Authored-by: shivsood <shivsood@microsoft.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants