SPARK-1583: Fix a bug that using java.util.HashMap by mistake by zsxwing · Pull Request #500 · apache/spark

zsxwing · 2014-04-23T06:36:52Z

JIRA: https://issues.apache.org/jira/browse/SPARK-1583

Does anyone know why using java.util.HashMap rather than mutable.HashMap? Some methods of java.util.HashMap are not generics and compiler can not help us find similar problems.

AmplabJenkins · 2014-04-23T06:37:55Z

Can one of the admins verify this patch?

rxin · 2014-04-23T06:38:00Z

Jenkins, test this please.

rxin · 2014-04-23T06:38:25Z

Mostly because java's HashMap is faster than Scala's ...

AmplabJenkins · 2014-04-23T06:42:55Z

Merged build triggered.

AmplabJenkins · 2014-04-23T06:43:02Z

Merged build started.

AmplabJenkins · 2014-04-23T07:18:49Z

Merged build finished.

AmplabJenkins · 2014-04-23T07:18:50Z

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14360/

zsxwing · 2014-04-23T08:20:30Z

If I understand correctly, SaveStageAndTaskInfo in SparkListenerSuite lacks appropriate synchronization to guarantee memory visibility. That's why this test fails sometimes.

I'm back with another less trivial suggestion for ALS: In ALS for implicit feedback, input values are treated as weights on squared-errors in a loss function (or rather, the weight is a simple function of the input r, like c = 1 + alpha*r). The paper on which it's based assumes that the input is positive. Indeed, if the input is negative, it will create a negative weight on squared-errors, which causes things to go haywire. The optimization will try to make the error in a cell as large possible, and the result is silently bogus. There is a good use case for negative input values though. Implicit feedback is usually collected from signals of positive interaction like a view or like or buy, but equally, can come from "not interested" signals. The natural representation is negative values. The algorithm can be extended quite simply to provide a sound interpretation of these values: negative values should encourage the factorization to come up with 0 for cells with large negative input values, just as much as positive values encourage it to come up with 1. The implications for the algorithm are simple: * the confidence function value must not be negative, and so can become 1 + alpha*|r| * the matrix P should have a value 1 where the input R is _positive_, not merely where it is non-zero. Actually, that's what the paper already says, it's just that we can't assume P = 1 when a cell in R is specified anymore, since it may be negative This in turn entails just a few lines of code change in `ALS.scala`: * `rs(i)` becomes `abs(rs(i))` * When constructing `userXy(us(i))`, it's implicitly only adding where P is 1. That had been true for any us(i) that is iterated over, before, since these are exactly the ones for which P is 1. But now P is zero where rs(i) <= 0, and should not be added I think it's a safe change because: * It doesn't change any existing behavior (unless you're using negative values, in which case results are already borked) * It's the simplest direct extension of the paper's algorithm * (I've used it to good effect in production FWIW) Tests included. I tweaked minor things en route: * `ALS.scala` javadoc writes "R = Xt*Y" when the paper and rest of code defines it as "R = X*Yt" * RMSE in the ALS tests uses a confidence-weighted mean, but the denominator is not actually sum of weights Excuse my Scala style; I'm sure it needs tweaks. Author: Sean Owen <sowen@cloudera.com> Closes apache#500 from srowen/ALSNegativeImplicitInput and squashes the following commits: cf902a9 [Sean Owen] Support negative implicit input in ALS 953be1c [Sean Owen] Make weighted RMSE in ALS test actually weighted; adjust comment about R = X*Yt

rxin · 2014-04-23T18:16:39Z

Jenkins, retest this please.

rxin · 2014-04-23T18:17:00Z

Thanks @zsxwing. I've restarted the test. Do you have time to fix that flaky test?

AmplabJenkins · 2014-04-23T18:17:57Z

Merged build triggered.

AmplabJenkins · 2014-04-23T18:18:06Z

Merged build started.

AmplabJenkins · 2014-04-23T19:43:48Z

Merged build finished. All automated tests passed.

AmplabJenkins · 2014-04-23T19:43:48Z

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14382/

rxin · 2014-04-23T21:12:20Z

Thanks. I've merged this.

JIRA: https://issues.apache.org/jira/browse/SPARK-1583 Does anyone know why using `java.util.HashMap` rather than `mutable.HashMap`? Some methods of `java.util.HashMap` are not generics and compiler can not help us find similar problems. Author: zsxwing <zsxwing@gmail.com> Closes #500 from zsxwing/SPARK-1583 and squashes the following commits: 7bfd74d [zsxwing] SPARK-1583: Fix a bug that using java.util.HashMap by mistake (cherry picked from commit a664606) Signed-off-by: Reynold Xin <rxin@apache.org>

zsxwing · 2014-04-24T03:15:18Z

Do you have time to fix that flaky test?

Sure. I need some time to confirm my guess.

rxin · 2014-04-24T04:59:04Z

Actually it's probably fixed here already: https://github.com/apache/spark/pull/516/files

zsxwing · 2014-04-24T05:20:09Z

Looks great.

JIRA: https://issues.apache.org/jira/browse/SPARK-1583 Does anyone know why using `java.util.HashMap` rather than `mutable.HashMap`? Some methods of `java.util.HashMap` are not generics and compiler can not help us find similar problems. Author: zsxwing <zsxwing@gmail.com> Closes apache#500 from zsxwing/SPARK-1583 and squashes the following commits: 7bfd74d [zsxwing] SPARK-1583: Fix a bug that using java.util.HashMap by mistake

I'm back with another less trivial suggestion for ALS: In ALS for implicit feedback, input values are treated as weights on squared-errors in a loss function (or rather, the weight is a simple function of the input r, like c = 1 + alpha*r). The paper on which it's based assumes that the input is positive. Indeed, if the input is negative, it will create a negative weight on squared-errors, which causes things to go haywire. The optimization will try to make the error in a cell as large possible, and the result is silently bogus. There is a good use case for negative input values though. Implicit feedback is usually collected from signals of positive interaction like a view or like or buy, but equally, can come from "not interested" signals. The natural representation is negative values. The algorithm can be extended quite simply to provide a sound interpretation of these values: negative values should encourage the factorization to come up with 0 for cells with large negative input values, just as much as positive values encourage it to come up with 1. The implications for the algorithm are simple: * the confidence function value must not be negative, and so can become 1 + alpha*|r| * the matrix P should have a value 1 where the input R is _positive_, not merely where it is non-zero. Actually, that's what the paper already says, it's just that we can't assume P = 1 when a cell in R is specified anymore, since it may be negative This in turn entails just a few lines of code change in `ALS.scala`: * `rs(i)` becomes `abs(rs(i))` * When constructing `userXy(us(i))`, it's implicitly only adding where P is 1. That had been true for any us(i) that is iterated over, before, since these are exactly the ones for which P is 1. But now P is zero where rs(i) <= 0, and should not be added I think it's a safe change because: * It doesn't change any existing behavior (unless you're using negative values, in which case results are already borked) * It's the simplest direct extension of the paper's algorithm * (I've used it to good effect in production FWIW) Tests included. I tweaked minor things en route: * `ALS.scala` javadoc writes "R = Xt*Y" when the paper and rest of code defines it as "R = X*Yt" * RMSE in the ALS tests uses a confidence-weighted mean, but the denominator is not actually sum of weights Excuse my Scala style; I'm sure it needs tweaks. Author: Sean Owen <sowen@cloudera.com> Closes apache#500 from srowen/ALSNegativeImplicitInput and squashes the following commits: cf902a9 [Sean Owen] Support negative implicit input in ALS 953be1c [Sean Owen] Make weighted RMSE in ALS test actually weighted; adjust comment about R = X*Yt

This PR reverts back to using Scala 2.11 * Revert "Fix distribution publish to scala 2.12 apache#478" * Revert "[SPARK-25956] Make Scala 2.12 as default Scala version in Spark 3.0"

* Refactor for logs and results arch directory Now we have multiple different custom logs path exist in different OpenLab jobs. This patch try to build a consist mechanism and usage in order to avoiding end user and developer's confusion: Add $LOGS_PATH, $RESULTS_PATH global env. Prepare the {{ ansible_user_dir }}/workspace/logs, and {{ ansible_user_dir }}/workspace/test_results. All logs files (like debug log) should be stored in $LOGS_PATH, and the final test_results (like binaries, artifact) should be stored in $RESULTS_PATH. Close: theopenlab/openlab#238

SPARK-1583: Fix a bug that using java.util.HashMap by mistake

7bfd74d

asfgit closed this in a664606 Apr 23, 2014

zsxwing deleted the SPARK-1583 branch May 18, 2014 09:50

arjunshroff pushed a commit to arjunshroff/spark that referenced this pull request Nov 24, 2020

SPARK-539 Workaround for absent MapRDBJsonSplit class (apache#500)

19d6183

Conversation

zsxwing commented Apr 23, 2014

Uh oh!

AmplabJenkins commented Apr 23, 2014

Uh oh!

rxin commented Apr 23, 2014

Uh oh!

rxin commented Apr 23, 2014

Uh oh!

AmplabJenkins commented Apr 23, 2014

Uh oh!

AmplabJenkins commented Apr 23, 2014

Uh oh!

AmplabJenkins commented Apr 23, 2014

Uh oh!

AmplabJenkins commented Apr 23, 2014

Uh oh!

zsxwing commented Apr 23, 2014

Uh oh!

rxin commented Apr 23, 2014

Uh oh!

rxin commented Apr 23, 2014

Uh oh!

AmplabJenkins commented Apr 23, 2014

Uh oh!

AmplabJenkins commented Apr 23, 2014

Uh oh!

AmplabJenkins commented Apr 23, 2014

Uh oh!

AmplabJenkins commented Apr 23, 2014

Uh oh!

rxin commented Apr 23, 2014

Uh oh!

zsxwing commented Apr 24, 2014

Uh oh!

rxin commented Apr 24, 2014

Uh oh!

zsxwing commented Apr 24, 2014

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants