Skip to content

[SPARK-24834][CORE] use java comparison for float and double #21794

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

bavardage
Copy link

@bavardage bavardage commented Jul 17, 2018

The methods in Utils, nanSafeCompareDoubles and nanSafeCompareFloats have
the same semantics as java's java.lang.Double.compare and java.lang.Float.compare
respectively.

What changes were proposed in this pull request?

Remove the two functions in Utils, and replace the call sites (only two for each function) with the corresponding java comparison function.

How was this patch tested?

Existing spark tests should cover this.

The methods in `Util`, `nanSafeCompareDoubles` and `nanSafeCompareFloats` had
the same semantics as java's `java.lang.Double.compare` and `java.lang.Float.compare`
respectively.
@bavardage bavardage changed the title [SPARK-24834] use java comparison for float and double [SPARK-24834][spark-core] use java comparison for float and double Jul 17, 2018
@bavardage bavardage changed the title [SPARK-24834][spark-core] use java comparison for float and double [SPARK-24834][CORE] use java comparison for float and double Jul 17, 2018
@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@bavardage
Copy link
Author

cc @JoshRosen who introduced this code originally, I think

@bavardage
Copy link
Author

There is a way in which the nanSafeCompare* methods do differ from java built-in, but that wasn't captured in the test cases (or in the suggestive naming of the method), namely the handling of -0.0 vs 0.0.

java.lang.Double.compare(-0.0d, 0.0d) == -1
Utils.nanSafeCompareDoubles(-0.0d, 0.0d) == 0

If that behaviour is indeed required, I could abandon this PR and instead fix the test cases to capture this.

@kiszk
Copy link
Member

kiszk commented Jul 17, 2018

I think this is required since SparkSQL does not distinguish 0.0 from -0.0. Am I correct?
cc @gatorsmile @maropu

@kiszk
Copy link
Member

kiszk commented Jul 17, 2018

It would be good to add test cases for them since it is not covered now.

@srowen
Copy link
Member

srowen commented Jul 17, 2018

BTW if it becomes necessary to not change the semantics, I think the methods could at least be streamlined a bit:

if (x < y) {
  -1
} else if (x > y) {
  1
} else if (x == y) {
  0
} else {
  if (java.lang.Double.isNaN(x)) {
    if (java.lang.Double.isNaN(y)) {
      0
    } else {
      1
    }
  } else {
    -1
  }
}

More tests can't hurt, too.

@bavardage
Copy link
Author

it does seem that spark currently does distinguish -0 and 0, at least as far as groupbys go

scala> case class Thing(x : Float)
defined class Thing

scala> val df = Seq(Thing(0.0f), Thing(-0.0f),Thing(0.0f), Thing(-0.0f),Thing(0.0f), Thing(-0.0f),Thing(0.0f), Thing(-0.0f)).toDF
2018-07-17 13:17:08 WARN  ObjectStore:568 - Failed to get database global_temp, returning NoSuchObjectException
df: org.apache.spark.sql.DataFrame = [x: float]

scala> df.groupBy($"x").count
res0: org.apache.spark.sql.DataFrame = [x: float, count: bigint]

scala> res0.collect
res1: Array[org.apache.spark.sql.Row] = Array([-0.0,4], [0.0,4])

doubles are hashed via doubleToLongBits https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/hash.scala#L338 which gives different bitwise representations of positive and negative 0

@srowen
Copy link
Member

srowen commented Jul 18, 2018

spark-sql suggests that -0 and 0 are considered the same though. SELECT -0.0 == 0.0; returns true. It's probably essential not to change behavior here, but if performance is the issue, I think the method can be optimized.

@srowen
Copy link
Member

srowen commented Aug 17, 2018

I think we'd have to close this due to the behavior change, but would merge an optimization of the existing behavior.

@srowen srowen mentioned this pull request Aug 20, 2018
@asfgit asfgit closed this in b8788b3 Aug 21, 2018
@bavardage
Copy link
Author

yep fair - the intent I think was clarity rather than necessarily perf: it's misleading to have a method named 'nan safe' which has no special handling of nans. I'll look at opening a different PR which could increase clarity/may have minor perf benefit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants