DataFrame performance comparison: Scala vs. Python #215

lintool · 2018-05-02T11:56:33Z

There are three different ways we can run DataFrames:

Scala Spark: Scala DataFrames with Scala UDFs.
PySpark: Scala DataFrames accessed in Python, with Scala UDFs.
PySpark: Scala DataFrames accessed in Python, with Python UDFs.

In theory, (2) should be negligibly slower than (1) due to a bit of Python overhead. However, (3) is expected to be significantly slower. There's also a variant of (3) the uses vectorized Python UDFs, which we should investigate also.

Helpful links:

TitusAn · 2018-05-04T23:16:03Z

I made a very simple test script to actually see the difference between the vectorized and the non-vectorized versions of Python UDFs.

python-udf-vec-vs-non-vec.py

..it turns out that the vectorized Python UDF is much faster than the non-vectorized version. To prevent cached queries from interfering with the result, the test is repeated 10 times, with the vectorized version being the first one to run (so the invocation of non-vectorized version already has an advantage, if any).

Result (no vectorization): 318. Time: 4.68561577797 (sec)
Result (vectorization): 318. Time: 11.0421266556 (sec)

lintool · 2018-05-06T10:46:24Z

@TitusAn so (3) actually breaks down into:

(3a) normal Python UDFs
(3b) vectorized Python UDFs

I'd like a fair comparison to (1) and (2) - so next step, can you please backport the Python ExtractDomain UDF back to Scala so that we can benchmark cases (1) and (2)?

Note that there is already an ExtractDomain UDF in the df package, but its a wrapper around an rdd UDF, which is a different impl. I'd like to make sure we get a fair apples-to-apples benchmark, so I want to make sure the UDF is doing exactly the same thing, just Scala vs. Python.

TitusAn · 2018-05-12T22:47:52Z

Five tests are conducted:

Scala program calls Scala UDF via function (SSF): 5277.5 ms
Scala program calls Scala UDF via SQL (SSS): 5525 ms
Python program calls Scala UDF via function (PSF): 5650.1 ms
Python program calls Scala UDF via SQL (PSS): 5798.6 ms
Python program calls Python UDF via Function (PPF): 7946 ms

From the graph, it can be shown that Scala UDFs, no matter where they were called, are always the fastest implementation comparing to the equivalent version in Python. It is also found that calling Scala UDF from Python does suffer from overhead of crossing language boundary, and this overhead is around ten to twenty percent. Also, calling registered UDFs in a SQL expression is slower both in Python and Scala, comparing to directly invoking UDFs in Scala or Python scripts, with 'select' method of data frame class. This difference is possibly due to the time it takes to parse and evaluate SQL expressions before they can be acted upon.

A detailed reading can be found here:
udf_performance_doc.pdf

lintool · 2018-05-13T01:49:09Z

@TitusAn This is super awesome! Let's try to run experiments on a larger collection to see if the results hold up.

BTW, a bar chart for the above would be more appropriate; you can add 95% confidence intervals to the bars.

So, the AUT "best practices" seem to be shaping up to be as follows:

Deprecate RDDs and move to DF.
Write all "production" and commonly-used UDFs in Scala.
Run production jobs in Scala using DF.
For Jupyter integration and interactive exploration, Python DF calling Scala UDFs is workable, but with a noticeable performance hit.
In a crunch, write UDFs in Python to work with Python DF, but it's really going to be slow.

ruebot · 2019-07-17T17:35:09Z

@lintool is this issue still relevant? Or shall I close it?

lintool · 2019-07-27T16:04:24Z

Yup, let's close.

ruebot mentioned this issue May 2, 2018

PySpark performance bottlenecks: counting values #130

Closed

lintool mentioned this issue May 2, 2018

Register Scala functions for use in Pyspark #148

Closed

ruebot added the discussion label Aug 20, 2018

lintool closed this as completed Jul 27, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DataFrame performance comparison: Scala vs. Python #215

DataFrame performance comparison: Scala vs. Python #215

lintool commented May 2, 2018

TitusAn commented May 4, 2018

lintool commented May 6, 2018

TitusAn commented May 12, 2018

lintool commented May 13, 2018

ruebot commented Jul 17, 2019

lintool commented Jul 27, 2019

DataFrame performance comparison: Scala vs. Python #215

DataFrame performance comparison: Scala vs. Python #215

Comments

lintool commented May 2, 2018

TitusAn commented May 4, 2018

lintool commented May 6, 2018

TitusAn commented May 12, 2018

lintool commented May 13, 2018

ruebot commented Jul 17, 2019

lintool commented Jul 27, 2019