Skip to content

[WIP][SPARK-28151][SQL] Mapped ByteType to TinyINT for MsSQLServerDialect #25344

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 4 commits into from

Conversation

shivsood
Copy link
Contributor

@shivsood shivsood commented Aug 3, 2019

What changes were proposed in this pull request?

Change mapping of BYTETYPE from BYTE to TinyINT.

Why are the changes needed?

Writing dataframe with column type BYTETYPE fails when using JDBC connector for SQL Server. The problem is due to

  • (Write path) Incorrect mapping of BYTETYPE in getCommonJDBCType() in jdbcutils.scala where BYTETYPE gets mapped to BYTE text. It should be mapped to TINYINT
    case ByteType => Option(JdbcType("BYTE", java.sql.Types.TINYINT))

In getCatalystType() ( JDBC to Catalyst type mapping) TINYINT is mapped to INTEGER, while it should be mapped to BYTETYPE. Mapping to integer is ok from the point of view of upcasting, but will lead to 4 byte allocation rather than 1 byte for BYTETYPE.

  • (read path) Read path ends up calling makeGetter(dt: DataType, metadata: Metadata). The function sets the value in RDD row. The value is set per the data type. Here there is no mapping for BYTETYPE and thus results will result in an error when getCatalystType() is fixed.

Note : These issues were found when reading/writing with SQLServer.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Path was tested with integration test by adding test case to MsSqlServerIntegrationSuite.scala. Test case passed on local tests.

…erverDialect.scala.

The fix fails with the following error when df.write is done with a dataframe what contains a ByteType.

19/08/02 18:25:44 INFO Executor: Finished task 0.0 in stage 7.0 (TID 7). 1197 bytes result sent to driver
19/08/02 18:25:44 INFO TaskSetManager: Finished task 0.0 in stage 7.0 (TID 7) in 43 ms on localhost (executor driver) (1/2)
19/08/02 18:25:44 INFO CodeGenerator: Code generated in 14.586963 ms
19/08/02 18:25:44 ERROR Executor: Exception in task 1.0 in stage 7.0 (TID 8)
java.lang.RuntimeException: Error while encoding: java.lang.RuntimeException: java.lang.Integer is not a valid external type for schema of tinyint
if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null else validateexternaltype(getexternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true]), 0, serialNum), ByteType) AS serialNum#231
	at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.toRow(ExpressionEncoder.scala:344)
	at org.apache.spark.sql.SparkSession.$anonfun$createDataFrame$1(SparkSession.scala:367)
	at scala.collection.Iterator$$anon$10.next(Iterator.scala:459)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:731)
	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
	at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.savePartition(JdbcUtils.scala:662)
	at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.$anonfun$saveTable$1(JdbcUtils.scala:845)
@shivsood shivsood changed the title [WIP][SPARK-2815][SQL] Mapped ByteType to TinyINT for MsSQLServerDialect [WIP][SPARK-28151][SQL] Mapped ByteType to TinyINT for MsSQLServerDialect Aug 3, 2019
@shivsood
Copy link
Contributor Author

shivsood commented Aug 3, 2019

@dongjoon-hyun @srowen This is PR for ByteType fix. I am seeing a catalyst exception post my fix. I will need your help/guidance on resolving this. I see that i fails in ValidateExternalType() with mismatched type , but i am not able to get to exact flow of events here.

Copy link
Member

@srowen srowen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this supersede one of your earlier PRs?

@dongjoon-hyun
Copy link
Member

Hi, @shivsood . So, do you mean this PR fails at UT in your local environment?

@dongjoon-hyun
Copy link
Member

ok to test

@SparkQA
Copy link

SparkQA commented Aug 5, 2019

Test build #108642 has finished for PR 25344 at commit 1abf3fc.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Aug 5, 2019

Hi, @shivsood . Please fix the scala style issues first which @srowen already commented.

@shivsood
Copy link
Contributor Author

shivsood commented Aug 5, 2019

Hi, @shivsood . So, do you mean this PR fails at UT in your local environment?

@dongjoon-hyun No. I have added a new E2E test case where i create a table with BYTETYPE. That fails. I was aware of that failure and that why the fix. Post fix its failing at the next level ( refer to How was patch tested section of the PR).

@shivsood
Copy link
Contributor Author

shivsood commented Aug 5, 2019

Hi, @shivsood . Please fix the scala style issues first which @srowen already commented.

@dongjoon-hyun Sorry about these. I would fix. Have to check why these are not coming in my build. For me the source code build gives scala style warning, but test code does not seem to give me warnings.

…yte and not java.lang.interger. Fixed styling issues noted in the review comments
@shivsood
Copy link
Contributor Author

shivsood commented Aug 5, 2019

es this supersede one of your earlier PRs?

Yes.

Copy link
Contributor Author

@shivsood shivsood left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please fix the scala style issues first which @srowen already com

Done. I realize dev/stylecheck should be run manually for test cases. I presumed its get run automatically when i build. Apparently not for test code.

@SparkQA
Copy link

SparkQA commented Aug 5, 2019

Test build #108681 has finished for PR 25344 at commit 3a5a621.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • assert(types(1).equals(\"class java.lang.Byte\"))

@shivsood shivsood changed the title [WIP][SPARK-28151][SQL] Mapped ByteType to TinyINT for MsSQLServerDialect [SPARK-28151][SQL] Mapped ByteType to TinyINT for MsSQLServerDialect Oct 15, 2019
@dongjoon-hyun
Copy link
Member

Thank you for updating, @shivsood ! 😄

@SparkQA
Copy link

SparkQA commented Oct 15, 2019

Test build #112079 has finished for PR 25344 at commit 7658bbc.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • assert(colType(0) == \"class java.lang.Byte\")

@@ -21,6 +21,8 @@ import java.math.BigDecimal
import java.sql.{Connection, Date, Timestamp}
import java.util.Properties

import org.apache.spark.sql.{DataFrame, Row}
import org.apache.spark.sql.types._
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's remove this. I'll give the updated short test code.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please update the test code according to the comment.
(And, I revised the PR description according to the latest template).

@maropu
Copy link
Member

maropu commented Oct 23, 2019

@shivsood ping, are you there?

@shivsood
Copy link
Contributor Author

@shivsood ping, are you there?

@maropu Yes, Would handle this this week. Please note that ShortType issue is fixed SPARK-28152. Or are you pointing to a different issue?

@maropu
Copy link
Member

maropu commented Oct 24, 2019

That should be like this? https://github.com/apache/spark/pull/25344/files#diff-391379a5ec51082e2ae1209db15c02b3R549

    case ShortType =>
      (stmt: PreparedStatement, row: Row, pos: Int) =>
        stmt.setInt(pos + 1, row.getShort(pos))

=>

    case ShortType =>
      (stmt: PreparedStatement, row: Row, pos: Int) =>
        stmt.setShort(pos + 1, row.getShort(pos))

?

@srowen
Copy link
Member

srowen commented Oct 29, 2019

Ping @shivsood

@shivsood
Copy link
Contributor Author

shivsood commented Oct 30, 2019

That should be like this? https://github.com/apache/spark/pull/25344/files#diff-391379a5ec51082e2ae1209db15c02b3R549

    case ShortType =>
      (stmt: PreparedStatement, row: Row, pos: Int) =>
        stmt.setInt(pos + 1, row.getShort(pos))

=>

    case ShortType =>
      (stmt: PreparedStatement, row: Row, pos: Int) =>
        stmt.setShort(pos + 1, row.getShort(pos))

?
@maropu Thanks for pointing out. From the code this looks like a problem. I dont know how to repro a failure here. Will send out a separate PR for this in some time. Do you suspect that this code should lead some corruption that's hard to repro. IMO that a short(2 bytes) will get allocated here, but this code will end up writing as a integer.

@maropu Created JIRA https://issues.apache.org/jira/browse/SPARK-29644 for this issue. Will send a PR with the resolution and test.

@maropu The PR for the fix is #26301

@SparkQA
Copy link

SparkQA commented Oct 30, 2019

Test build #112872 has finished for PR 25344 at commit ec5315d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@srowen
Copy link
Member

srowen commented Oct 31, 2019

@shivsood @maropu so that I can keep track of this -- it seems like this is a combination of SPARK-29644 and SPARK-28152, but for bytes, right?

The question there was testing, whether the test should be more general, because the fix is general. Same here, how much of this is specific to SQL Server? the setByte part isn't, it seems.

And the question there too was whether a deeper fix is needed? does that need to be reflected here? I'm pretty OK to fix this isolated issue right now if it by itself solves a problem.

@maropu
Copy link
Member

maropu commented Oct 31, 2019

@shivsood Can you solve the "SQL server-specific" issue of MsSqlServerDialect in this pr, first? Then, you could address more general issues of JdbcUtils in following prs.

@shivsood
Copy link
Contributor Author

shivsood commented Nov 1, 2019

@shivsood Can you solve the "SQL server-specific" issue of MsSqlServerDialect in this pr, first? Then, you could address more general issues of JdbcUtils in following prs.
@maropu the fix will not work for "SQL Server" if change in JDBCUtils is not done. As far as i remember, it gives an error in the read path. Let me check this again and get back. Ideally i want to keep the "SQL server" fix and generic fix separate.

I am putting this PR back into WIP for now.

@shivsood shivsood changed the title [SPARK-28151][SQL] Mapped ByteType to TinyINT for MsSQLServerDialect [WIP][SPARK-28151][SQL] Mapped ByteType to TinyINT for MsSQLServerDialect Nov 1, 2019
@maropu
Copy link
Member

maropu commented Nov 1, 2019

@maropu the fix will not work for "SQL Server" if change in JDBCUtils is not done. As far as i remember, it gives an error in the read path. Let me check this again and get back. Ideally i want to keep the "SQL server" fix and generic fix separate.

Ah, I see. If so, how about fixing the more general issue in JDBCUtils, first? Then, fixing the "SQL server" issue.

@shivsood
Copy link
Contributor Author

shivsood commented Nov 1, 2019

Ah, I see. If so, how about fixing the more general issue in JDBCUtils, first? Then, fixing the "SQL server" issue

That would be right approach. It would be quite a test effort. I have to get familiar with unit test and E2E test for all other servers. Will start some work on that soon.

@maropu
Copy link
Member

maropu commented Nov 1, 2019

Thanks, @shivsood !

@shivsood
Copy link
Contributor Author

shivsood commented Nov 1, 2019

Thanks, @shivsood !

Thanks a lot for your comments every one.

@srowen
Copy link
Member

srowen commented Nov 2, 2019

Is this superseded by #26301 ?

@shivsood
Copy link
Contributor Author

shivsood commented Nov 5, 2019

Is this superseded by #26301 ?

Yes, #26301 supersedes this fix. Will remove this PR after #26301 is accepted.

@shivsood
Copy link
Contributor Author

Closing as this is fixed by PR #26301 which is now accepted to master.

@shivsood shivsood closed this Nov 14, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants