Skip to content

Implement VectorUdf and use it in Queries 1 and 8 of TPCH benchmarks. #127

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 10 commits into from
Jun 7, 2019

Conversation

eerhardt
Copy link
Member

@eerhardt eerhardt commented May 31, 2019

As requested in #45, this brings Vector UDF support to .NET for Apache Spark.

See https://databricks.com/blog/2017/10/30/introducing-vectorized-udfs-for-pyspark.html for the equivalent Python capabilities. This first round of implementation supports PandasUDFType.SCALAR type of UDFs, but grouped UDFs can be implemented in the future.

On an Azure DS5 Ubuntu 18 box, I'm seeing the following results in these two TPCH queries using the 1 GB dataset and a single core --master local:

Query One-at-a-time Vectorized
Query 1 10.3 s 8.3 s
Query 8 30.7 s 29.1 s

Notes

  1. I created a new "Experimental" project because exposing VectorUdf using the Arrow types directly is not the long term plan. The hope is that there will be a fully featured DataFrame type in .NET that can be created from an Arrow RecordBatch, with usable APIs. VectorUdf will be considered an "experimental" API until we have a stable long-term plan for how to pass columnar data to a .NET method.
  2. For other TPCH queries that have the same logic to compute the discount, using VectorUdf gave me worse performance numbers. My assumption is that the other queries aren't dealing with as much data, and converting to a columnar format (Arrow) is outweighing the benefits of vectorization.
  3. I could have used the new C# hardware intrinsics APIs in .NET Core 3.0 to perform the vectorization, but that would either force everyone to use a preview 3.0 SDK, or I would have to perform unnatural build gymnastics to allow building without the 3.0 SDK. So instead, I chose to use Vector<T>, which still uses vectorization under the covers and works on the current targeted frameworks.

@stephentoub
Copy link
Member

Nice.

eerhardt added 4 commits June 3, 2019 12:53
- Remove Vector<T> usage, instead have simple price calculation methods.
- Formatting fix ups
- Change generic type names A1 => T1
@eerhardt
Copy link
Member Author

eerhardt commented Jun 4, 2019

I've responded to all feedback and updated the PR. Please take a look.

@rapoth rapoth added this to the June 2019 milestone Jun 5, 2019
@suhsteve
Copy link
Member

suhsteve commented Jun 6, 2019

LGTM

Copy link
Contributor

@imback82 imback82 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks @eerhardt!

@imback82 imback82 merged commit c3a65a2 into dotnet:master Jun 7, 2019
@eerhardt eerhardt deleted the VectorUdf branch June 7, 2019 12:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants