Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for Is_In and Not_In to Filter_Condition #3790

Merged
merged 28 commits into from
Oct 15, 2022

Conversation

radeusgd
Copy link
Member

@radeusgd radeusgd commented Oct 11, 2022

Pull Request Description

Implements https://www.pivotaltracker.com/story/show/183389945

Important Notes

Checklist

Please include the following checklist in your PR:

  • The documentation has been updated if necessary.
  • All code conforms to the
    Scala,
    Java,
    and
    Rust
    style guides.
  • All code has been tested:
    • Unit tests have been written where possible.
    • If GUI codebase was changed: Enso GUI was tested when built using BOTH
      ./run ide build and ./run ide watch.

@radeusgd radeusgd self-assigned this Oct 11, 2022
@radeusgd radeusgd force-pushed the wip/radeusgd/filter-condition-is-in-183389945 branch from 06ecf9f to 0c10218 Compare October 12, 2022 10:33
@radeusgd radeusgd marked this pull request as ready for review October 12, 2022 12:52
@@ -1190,7 +1190,6 @@ lazy val parser = (project in file("lib/scala/parser"))
s"-Djava.library.path=$root/target/rust/debug"
},
libraryDependencies ++= Seq(
"com.storm-enroute" %% "scalameter" % scalameterVersion % "bench",
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this parser project is used at all, I think it was a prototype of the Rust integration, superseded by the new Rust parser. Can we just remove the whole project? cc: @hubertp @jdunkerley

I removed this line in particular as bench config is not available for this project and cleaning it didn't work because of that. But if possible maybe we should remove it altogether if it is indeed unused.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this can be removed. But maybe @JaroslavTulach can confirm since he is tinkering with the integration

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it builds after removing the line feel free to remove it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The line yep, I was meaning about removing the whole subproject - it seems to me it's some old Rust parser integration effort (it's not the old parser still in use today, that's syntax) - which was superseded by yours effort and I don't think it is used anywhere. So I think we should ditch it altogether to avoid the maintenance.

Comment on lines +944 to +1029
expected_max_time_ms = init.first.total_milliseconds * 2
check_timing name ~action =
res = Duration.time_execution action
runtime_ms = res.first.total_milliseconds
if runtime_ms > expected_max_time_ms then
Test.fail "Expected `Is_In` on "+name+" to be efficient, but it took "+runtime_ms.to_text+"ms while initialization itself took just "+expected_max_time_ms.to_text+"ms."

check_timing "integers" <|
t.filter "X" (Filter_Condition.Is_In vec) . at "X" . to_vector . should_equal expected_vector

check_timing "booleans" <|
t.filter (t.at "X" % 2 == 0) (Filter_Condition.Is_In bool_vec) . at "X" . to_vector . should_equal expected_vector_2

check_timing "dates" <|
t.filter date_col (Filter_Condition.Is_In dates_vec) . at "X" . to_vector . should_equal expected_vector
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did a timing check just to have a decent comparison and:

init = 501
integers = 31
booleans = 28
dates = 35

so init*2 seems to still have a decent buffer.

For example if I disable the vectorized variant for booleans it gives:

    - [FAILED] should perform `Is_In` efficiently for builtin types [14626ms]
        Reason: Expected `Is_In` on booleans to be efficient, but it took 13852ms while initialization itself took just 1472ms.

So we can see that 2*init is a decent buffer - for good implementation it gives enough time, but the wrong implementation exceeds that greatly anyway. Let's see how it behaves on the CI though.

@radeusgd radeusgd force-pushed the wip/radeusgd/filter-condition-is-in-183389945 branch from f260e78 to c319934 Compare October 13, 2022 07:45
Copy link
Member

@jdunkerley jdunkerley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nothing major a few comments to improve but looks good.

* <p>It indicates whether the vector contained a null value and contains a hashmap of the vector
* elements for faster contains checks.
*/
public record CompactRepresentation(HashSet<Object> coercedValues, boolean hasNulls) {}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we do a generic version? I don't think will make any real difference but wondered if nicer to have HashSet<String> etc for example.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried. It was a bit more complex than we'd like, but maybe the changes could open venues for writing more type-aware operations in the future.

I first made the SpecializedIsInOp generic as asked ( 2430d46 ), to do so I needed to introduce some type markers to make stuff compile.

Then in 5eb9c3f I have removed the temporarily added TypedStorage<T> marker interface and just made Storage know its type Storage<T>. This required some more work though.

Please tell me if you think it's worth keeping both of these changes or only one of them or none - I'll amend as necessary and then get the PR merged finally.

Copy link
Contributor

@hubertp hubertp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 re sbt change

@radeusgd radeusgd force-pushed the wip/radeusgd/filter-condition-is-in-183389945 branch 2 times, most recently from 4c3bdee to 5eb9c3f Compare October 14, 2022 11:59
@radeusgd radeusgd force-pushed the wip/radeusgd/filter-condition-is-in-183389945 branch from ef188f2 to bc5b520 Compare October 14, 2022 15:43
@radeusgd radeusgd added the CI: Ready to merge This PR is eligible for automatic merge label Oct 15, 2022
@radeusgd radeusgd force-pushed the wip/radeusgd/filter-condition-is-in-183389945 branch from 5913848 to ae84e89 Compare October 15, 2022 10:44
@mergify mergify bot merged commit 82de8f8 into develop Oct 15, 2022
@mergify mergify bot deleted the wip/radeusgd/filter-condition-is-in-183389945 branch October 15, 2022 11:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CI: Ready to merge This PR is eligible for automatic merge
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants