Skip to content

[SPARK-3121] Wrong implementation of implicit bytesWritableConverter #2712

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 5 commits into from
Closed

[SPARK-3121] Wrong implementation of implicit bytesWritableConverter #2712

wants to merge 5 commits into from

Conversation

james64
Copy link
Contributor

@james64 james64 commented Oct 8, 2014

val path = ... //path to seq file with BytesWritable as type of both key and value
val file = sc.sequenceFileArray[Byte],Array[Byte]
file.take(1)(0)._1

This prints incorrect content of byte array. Actual content starts with correct one and some "random" bytes and zeros are appended. BytesWritable has two methods:

getBytes() - return content of all internal array which is often longer then actual value stored. It usually contains the rest of previous longer values

copyBytes() - return just begining of internal array determined by internal length property

It looks like in implicit conversion between BytesWritable and Array[byte] getBytes is used instead of correct copyBytes.

@dbtsai

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@dbtsai
Copy link
Member

dbtsai commented Oct 8, 2014

Jenkins, please start the test.

@SparkQA
Copy link

SparkQA commented Oct 8, 2014

QA tests have started for PR 2712 at commit 480f9cd.

  • This patch merges cleanly.

@sryza
Copy link
Contributor

sryza commented Oct 8, 2014

Great catch.

A concern is that calling Array#take requires an implicit conversion, which has some performance impact that might be unacceptable for this method that can get called in a tight loop.

http://villane.wordpress.com/2008/02/02/learning-scala-performance-impact-of-implicit-conversions/

@SparkQA
Copy link

SparkQA commented Oct 8, 2014

QA tests have finished for PR 2712 at commit 480f9cd.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@james64
Copy link
Contributor Author

james64 commented Oct 8, 2014

Originaly I wanted to just replace getBytes method with copyBytes. It is available in newer versions of api but I found an older version is imported in spark. I am not very familiar with what hadoop api is used in spark yet. So do you suggest to implement it without usage of take method?

@sryza
Copy link
Contributor

sryza commented Oct 9, 2014

Hmm, yeah, copyBytes is no good if it doesn't appear in Hadoop 1.

My suggestion would be to use from copyOfRange from java.util.Arrays.

@james64
Copy link
Contributor Author

james64 commented Oct 9, 2014

I pushed the new version. I guess jenkins test will kick out automatically right?

@sryza
Copy link
Contributor

sryza commented Oct 9, 2014

Jenkins, test this please.

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@dbtsai
Copy link
Member

dbtsai commented Oct 10, 2014

Jenkins, test this please.

@JoshRosen
Copy link
Contributor

Jenkins, this is ok to test.

@SparkQA
Copy link

SparkQA commented Oct 10, 2014

QA tests have started for PR 2712 at commit f92ffa6.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Oct 10, 2014

QA tests have finished for PR 2712 at commit f92ffa6.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21602/Test FAILed.

@@ -1409,7 +1411,7 @@ object SparkContext extends Logging {
simpleWritableConverter[Boolean, BooleanWritable](_.get)

implicit def bytesWritableConverter(): WritableConverter[Array[Byte]] = {
simpleWritableConverter[Array[Byte], BytesWritable](_.getBytes)
simpleWritableConverter[Array[Byte], BytesWritable](bw => Arrays.copyOfRange(bw.getBytes, 0, bw.getLength))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it looks like this goes past 100 characters

@SparkQA
Copy link

SparkQA commented Oct 10, 2014

QA tests have started for PR 2712 at commit 406e26c.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Oct 10, 2014

QA tests have finished for PR 2712 at commit 406e26c.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21603/Test FAILed.

@dbtsai
Copy link
Member

dbtsai commented Oct 11, 2014

It's failing at FlumeStreamSuite.scala:109 which seems to be unrelated to this patch.

@sryza
Copy link
Contributor

sryza commented Oct 11, 2014

One more nit: the added java import should go with the other java imports.

@sryza
Copy link
Contributor

sryza commented Oct 11, 2014

Otherwise, LGTM

@james64
Copy link
Contributor Author

james64 commented Oct 11, 2014

Can it be that test Flume test failed due to upstream changes? It is passing for me locally now.

@SparkQA
Copy link

SparkQA commented Oct 11, 2014

QA tests have started for PR 2712 at commit 1b20d51.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Oct 11, 2014

QA tests have finished for PR 2712 at commit 1b20d51.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21641/Test PASSed.

@JoshRosen
Copy link
Contributor

That particular Flume test is known to be flaky; I think that TD is working on a rewrite / fix for that test suite.

import org.apache.hadoop.io.BytesWritable

class SparkContextSuite extends FunSuite {
test("test of writing spark scala test") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test could use a better name. I'd also add a comment, like // Regression test for SPARK-3121 to help readers link this back to the JIRA.

@JoshRosen
Copy link
Contributor

Actually, ignore my earlier (deleted) comments; this looks like a valid issue (see HADOOP-6298: "BytesWritable#getBytes is a bad name that leads to programming mistakes").

@@ -1409,7 +1410,9 @@ object SparkContext extends Logging {
simpleWritableConverter[Boolean, BooleanWritable](_.get)

implicit def bytesWritableConverter(): WritableConverter[Array[Byte]] = {
simpleWritableConverter[Array[Byte], BytesWritable](_.getBytes)
simpleWritableConverter[Array[Byte], BytesWritable](bw =>
Arrays.copyOfRange(bw.getBytes, 0, bw.getLength)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a one-line comment here that explains why we need to make this copy?

@JoshRosen
Copy link
Contributor

This looks good to me; sorry for my earlier confusion. If you add a comment and change the name of the test, I'll merge this and cherry-pick it back into branch-1.1 and branch-1.0.

@james64
Copy link
Contributor Author

james64 commented Oct 11, 2014

Sorry for the test name. Now it should be all fine including commets.

@SparkQA
Copy link

SparkQA commented Oct 11, 2014

QA tests have started for PR 2712 at commit f85d24c.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Oct 12, 2014

QA tests have finished for PR 2712 at commit f85d24c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21649/Test PASSed.

@JoshRosen
Copy link
Contributor

This looks great; thanks for adding the comments. I'm going to merge this into master and backport it to branch-1.0 and branch-1.1.

@asfgit asfgit closed this in fc616d5 Oct 13, 2014
asfgit pushed a commit that referenced this pull request Oct 13, 2014
val path = ... //path to seq file with BytesWritable as type of both key and value
val file = sc.sequenceFile[Array[Byte],Array[Byte]](path)
file.take(1)(0)._1

This prints incorrect content of byte array. Actual content starts with correct one and some "random" bytes and zeros are appended. BytesWritable has two methods:

getBytes() - return content of all internal array which is often longer then actual value stored. It usually contains the rest of previous longer values

copyBytes() - return just begining of internal array determined by internal length property

It looks like in implicit conversion between BytesWritable and Array[byte] getBytes is used instead of correct copyBytes.

dbtsai

Author: Jakub Dubovský <james64@inMail.sk>
Author: Dubovsky Jakub <dubovsky@avast.com>

Closes #2712 from james64/3121-bugfix and squashes the following commits:

f85d24c [Jakub Dubovský] Test name changed, comments added
1b20d51 [Jakub Dubovský] Import placed correctly
406e26c [Jakub Dubovský] Scala style fixed
f92ffa6 [Dubovsky Jakub] performance tuning
480f9cd [Dubovsky Jakub] Bug 3121 fixed

(cherry picked from commit fc616d5)
Signed-off-by: Josh Rosen <joshrosen@apache.org>
asfgit pushed a commit that referenced this pull request Oct 13, 2014
val path = ... //path to seq file with BytesWritable as type of both key and value
val file = sc.sequenceFile[Array[Byte],Array[Byte]](path)
file.take(1)(0)._1

This prints incorrect content of byte array. Actual content starts with correct one and some "random" bytes and zeros are appended. BytesWritable has two methods:

getBytes() - return content of all internal array which is often longer then actual value stored. It usually contains the rest of previous longer values

copyBytes() - return just begining of internal array determined by internal length property

It looks like in implicit conversion between BytesWritable and Array[byte] getBytes is used instead of correct copyBytes.

dbtsai

Author: Jakub Dubovský <james64@inMail.sk>
Author: Dubovsky Jakub <dubovsky@avast.com>

Closes #2712 from james64/3121-bugfix and squashes the following commits:

f85d24c [Jakub Dubovský] Test name changed, comments added
1b20d51 [Jakub Dubovský] Import placed correctly
406e26c [Jakub Dubovský] Scala style fixed
f92ffa6 [Dubovsky Jakub] performance tuning
480f9cd [Dubovsky Jakub] Bug 3121 fixed

(cherry picked from commit fc616d5)
Signed-off-by: Josh Rosen <joshrosen@apache.org>
@james64 james64 deleted the 3121-bugfix branch October 13, 2014 11:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants