Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add rounding functions to the Column type #6817

Merged
merged 31 commits into from
Jun 1, 2023
Merged

Conversation

GregoryTravis
Copy link
Contributor

@GregoryTravis GregoryTravis commented May 23, 2023

Checklist

Please ensure that the following checklist has been satisfied before submitting the PR:

  • The documentation has been updated, if necessary.
  • Screenshots/screencasts have been attached, if there are any visual changes. For interactive or animated visual changes, a screencast is preferred.
  • All code follows the
    Scala,
    Java,
    and
    Rust
    style guides. In case you are using a language not listed above, follow the Rust style guide.
  • All code has been tested:
    • Unit tests have been written where possible.
    • If GUI codebase was changed, the GUI was tested when built using ./run ide build.

@GregoryTravis GregoryTravis changed the title Wip/gmt/6805 col round Add rounding functions to the Column type #6805 May 23, 2023
@GregoryTravis GregoryTravis linked an issue May 23, 2023 that may be closed by this pull request
@GregoryTravis GregoryTravis changed the title Add rounding functions to the Column type #6805 Add rounding functions to the Column type May 24, 2023
@GregoryTravis GregoryTravis marked this pull request as ready for review May 24, 2023 19:23
Comment on lines +749 to +752
ceil : Column ! Invalid_Value_Type
ceil self = Value_Type.expect_numeric self <|
fun = _.ceil
Column_Ops.map_over_storage self fun make_long_builder skip_nothing=True
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm really not sure if these operations should be implemented in Enso.

Currently most of the 'primitive' operations, especially on numeric columns are implemented in Java, for better efficiency.

Do we want to move towards doing these in Enso? I think we should measure what is the difference in performance.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another issue.

Currently I see that the ceil method may return an EnsoBigInteger (refer to the decimal/CeilNode.java).

However, currently the Table library storage does not support Big Integer storage - only LongStorage. So this will likely fail in some way.

I think we need to have tests checking what happens in the scenario where the double value magnitude exceeds the long range, to verify the behaviour.

IMO currently it could be valid to return Nothing in such a case and attach some kind of Arithmetic_Overflow exception. Extending Table to handle Big Integers is a separate issue (currently unscheduled, @jdunkerley do we have plans to support this?)

Copy link
Member

@radeusgd radeusgd May 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be clear about the first issue. The map_over_storage is OK. Its usage is warranted when the operation we perform on each row must run some Enso logic (some complicated Enso function that is not worth replicating into Java, or has so expensive logic anyway the benefit of inlining it to Java would be negligible).

But I think we shouldn't use Enso operations for primitive operations and instead implement them directly in Java. That is the current architecture of the primitive-value storages in the Table library. We can discuss if it should be changed, but to do so I think we need to do some performance measurements to take informed decisions.

(For example one of the reasons we do the primitive operations in Java is that on long/double types it allows to completely avoid boxing the values, which I don't think is avoidable when talking between Enso and Java).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think worth making this a vectorised operation on the NumberStorage.

We can use the Core_Utils to share the common implementation. Lets put it as a follow up ticket though please.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can use the Core_Utils to share the common implementation. Lets put it as a follow up ticket though please.

We could make the round implementation in Core_Utils, but then it would no longer be able to be written in Enso...

As for all the others - ceil, floor and I think truncate - we seem to be doing just the basic Java operations Math.ceil etc. in the 'primitive' path. The only additional handling is to handle promotion to BigInteger if the result exceeds the range of long. But BigInteger is currently not supported by Table anyway, so I don't think sharing the implementation of these gives us any value at this moment - because the pure Enso side needs the additional BigInteger handling that has some Truffle magic, and the Table side needs to just report a warning/failure in case of an overflow - the common part is just delegating to Math.ceil and co. which IMO isn't enough to make sharing worth it.

(But I'd make these vectorized nonetheless, I'm only unsure about round as that would require us moving the whole implementation into Java - it could make it more efficient for tables though, so may as well be worth it, but it's quite a bit more work).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok - not need to put in Core_Utils then.

Worth us seeing the relative performance of rounding 1,000,000 values as a map vs a vectorised op.

Copy link
Member

@radeusgd radeusgd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The API and tests look ok.

  1. I think we need to have tests checking what is the behaviour when the double value exceeds the long range - since the Table currently does not support Big Integer storage, I'm not sure what will happen and I'd like to verify it.
  2. Ideally, I don't think using map_over_storage for the primitive numeric operations like ceil and floor, possibly also truncate is the right choice. I think to be consistent with the architecture of the Table library, these should be implemented as UnaryMapOperation.

I think it is fine to use map_over_storage for round since it includes pretty complex Enso logic which we don't want to replicate into Java. But the other operations, if I'm reading correctly, just delegate to Java methods, so I think they should be 'vectorized'.

@radeusgd
Copy link
Member

I reckon the 'vectorization' of the operations into Java will happen as a separate task.

  1. Still I think we need tests for the case where the double is too big to store in our LongStorage after rounding:
polyglot java import java.lang.Long

... =
	max_long = Java_Long.MAX_VALUE
	too_big_double = (max_long + 1.0) * 100.0
	table = Table.new [["X", [1.0, 2.9, too_big_double, 12.1]]]
    col = table.at "X"
    c1 = col.ceil
	c1 . to_vector . should_equal ?
	# One of these:
	Problems.assume_no_problems c1
	Problems.expect_warning Arithmetic_Error c1

	c2 = col.floor
	c3 = col.truncate
	c4 = col.round
	...
  1. I assume that DB operations are not in scope of this PR. However, to keep the APIs consistent (we really should add a test verifying the consistency at some point...) please add method stubs that throw Unsupported_Database_Operation.Error "Operation not yet implemented in the Database backend".

Copy link
Member

@jdunkerley jdunkerley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets add a follow up to look at vectorising the operations and the relative performance.

Otherwise LGTM.

Comment on lines +749 to +752
ceil : Column ! Invalid_Value_Type
ceil self = Value_Type.expect_numeric self <|
fun = _.ceil
Column_Ops.map_over_storage self fun make_long_builder skip_nothing=True
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok - not need to put in Core_Utils then.

Worth us seeing the relative performance of rounding 1,000,000 values as a map vs a vectorised op.

@@ -585,7 +585,7 @@ spec =
231.2 . round . should_be_a Integer
231.2 . round -1 . should_be_a Integer

Test.specify "Edge cases" <|
Test.specify "Edge cases" pending="Re-enable this if the 15-digit restriction is removed" <|
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we plan to ever remove this restriction?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then I think we should just remove these tests.

Copy link
Member

@jdunkerley jdunkerley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with @radeusgd small comments - worth clearing those up before merging.

@@ -585,7 +585,7 @@ spec =
231.2 . round . should_be_a Integer
231.2 . round -1 . should_be_a Integer

Test.specify "Edge cases" <|
Test.specify "Edge cases" pending="Re-enable this if the 15-digit restriction is removed" <|
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope

@radeusgd radeusgd mentioned this pull request May 30, 2023
5 tasks
@GregoryTravis GregoryTravis added the CI: Ready to merge This PR is eligible for automatic merge label May 31, 2023
@mergify mergify bot merged commit 0337180 into develop Jun 1, 2023
@mergify mergify bot deleted the wip/gmt/6805-col-round branch June 1, 2023 20:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CI: Ready to merge This PR is eligible for automatic merge
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add rounding functions to the Column type
3 participants