Implement VectorUdf and use it in Queries 1 and 8 of TPCH benchmarks. #127

eerhardt · 2019-05-31T23:24:02Z

As requested in #45, this brings Vector UDF support to .NET for Apache Spark.

See https://databricks.com/blog/2017/10/30/introducing-vectorized-udfs-for-pyspark.html for the equivalent Python capabilities. This first round of implementation supports PandasUDFType.SCALAR type of UDFs, but grouped UDFs can be implemented in the future.

On an Azure DS5 Ubuntu 18 box, I'm seeing the following results in these two TPCH queries using the 1 GB dataset and a single core --master local:

Query	One-at-a-time	Vectorized
Query 1	10.3 s	8.3 s
Query 8	30.7 s	29.1 s

Notes

I created a new "Experimental" project because exposing VectorUdf using the Arrow types directly is not the long term plan. The hope is that there will be a fully featured DataFrame type in .NET that can be created from an Arrow RecordBatch, with usable APIs. VectorUdf will be considered an "experimental" API until we have a stable long-term plan for how to pass columnar data to a .NET method.
For other TPCH queries that have the same logic to compute the discount, using VectorUdf gave me worse performance numbers. My assumption is that the other queries aren't dealing with as much data, and converting to a columnar format (Arrow) is outweighing the benefits of vectorization.
I could have used the new C# hardware intrinsics APIs in .NET Core 3.0 to perform the vectorization, but that would either force everyone to use a preview 3.0 SDK, or I would have to perform unnatural build gymnastics to allow building without the 3.0 SDK. So instead, I chose to use Vector<T>, which still uses vectorization under the covers and works on the current targeted frameworks.

benchmark/csharp/Tpch/VectorizedFunctions.cs

src/csharp/Microsoft.Spark.Experimental/Sql/ExperimentalFunctions.cs

src/csharp/Microsoft.Spark/Sql/ArrowArrayHelpers.cs

stephentoub · 2019-06-03T14:53:03Z

Nice.

src/csharp/Microsoft.Spark.E2ETest/IpcTests/Sql/DataFrameTests.cs

benchmark/csharp/Tpch/VectorizedFunctions.cs

src/csharp/Microsoft.Spark.UnitTest/UdfWrapperTests.cs

src/csharp/Microsoft.Spark/Utils/UdfUtils.cs

src/csharp/Microsoft.Spark/Sql/UDFRegistration.cs

- Remove Vector<T> usage, instead have simple price calculation methods. - Formatting fix ups - Change generic type names A1 => T1

eerhardt · 2019-06-04T15:56:38Z

I've responded to all feedback and updated the PR. Please take a look.

benchmark/csharp/Tpch/VectorFunctions.cs

benchmark/csharp/Tpch/TpchFunctionalQueries.cs

src/csharp/Microsoft.Spark.Experimental/Sql/ExperimentalFunctions.cs

src/csharp/Microsoft.Spark/Utils/UdfUtils.cs

suhsteve · 2019-06-06T16:19:57Z

LGTM

imback82

LGTM. Thanks @eerhardt!

Implement VectorUdf and use it in Queries 1 and 8 of TPCH benchmarks.

77416d5

eerhardt requested review from suhsteve, stephentoub, rapoth and imback82 May 31, 2019 23:24