Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DataFrame performance comparison: Scala vs. Python #215

Closed
lintool opened this issue May 2, 2018 · 6 comments
Closed

DataFrame performance comparison: Scala vs. Python #215

lintool opened this issue May 2, 2018 · 6 comments

Comments

@lintool
Copy link
Member

lintool commented May 2, 2018

There are three different ways we can run DataFrames:

  1. Scala Spark: Scala DataFrames with Scala UDFs.
  2. PySpark: Scala DataFrames accessed in Python, with Scala UDFs.
  3. PySpark: Scala DataFrames accessed in Python, with Python UDFs.

In theory, (2) should be negligibly slower than (1) due to a bit of Python overhead. However, (3) is expected to be significantly slower. There's also a variant of (3) the uses vectorized Python UDFs, which we should investigate also.

Helpful links:

@TitusAn
Copy link
Contributor

TitusAn commented May 4, 2018

I made a very simple test script to actually see the difference between the vectorized and the non-vectorized versions of Python UDFs.

python-udf-vec-vs-non-vec.py

..it turns out that the vectorized Python UDF is much faster than the non-vectorized version. To prevent cached queries from interfering with the result, the test is repeated 10 times, with the vectorized version being the first one to run (so the invocation of non-vectorized version already has an advantage, if any).

Result (no vectorization): 318. Time: 4.68561577797 (sec)
Result (vectorization): 318. Time: 11.0421266556 (sec)

@lintool
Copy link
Member Author

lintool commented May 6, 2018

@TitusAn so (3) actually breaks down into:

  • (3a) normal Python UDFs
  • (3b) vectorized Python UDFs

I'd like a fair comparison to (1) and (2) - so next step, can you please backport the Python ExtractDomain UDF back to Scala so that we can benchmark cases (1) and (2)?

Note that there is already an ExtractDomain UDF in the df package, but its a wrapper around an rdd UDF, which is a different impl. I'd like to make sure we get a fair apples-to-apples benchmark, so I want to make sure the UDF is doing exactly the same thing, just Scala vs. Python.

@TitusAn
Copy link
Contributor

TitusAn commented May 12, 2018

Five tests are conducted:

Scala program calls Scala UDF via function (SSF): 5277.5 ms
Scala program calls Scala UDF via SQL (SSS): 5525 ms
Python program calls Scala UDF via function (PSF): 5650.1 ms
Python program calls Scala UDF via SQL (PSS): 5798.6 ms
Python program calls Python UDF via Function (PPF): 7946 ms

udf_performance

From the graph, it can be shown that Scala UDFs, no matter where they were called, are always the fastest implementation comparing to the equivalent version in Python. It is also found that calling Scala UDF from Python does suffer from overhead of crossing language boundary, and this overhead is around ten to twenty percent. Also, calling registered UDFs in a SQL expression is slower both in Python and Scala, comparing to directly invoking UDFs in Scala or Python scripts, with 'select' method of data frame class. This difference is possibly due to the time it takes to parse and evaluate SQL expressions before they can be acted upon.

A detailed reading can be found here:
udf_performance_doc.pdf

@lintool
Copy link
Member Author

lintool commented May 13, 2018

@TitusAn This is super awesome! Let's try to run experiments on a larger collection to see if the results hold up.

BTW, a bar chart for the above would be more appropriate; you can add 95% confidence intervals to the bars.

So, the AUT "best practices" seem to be shaping up to be as follows:

  • Deprecate RDDs and move to DF.
  • Write all "production" and commonly-used UDFs in Scala.
  • Run production jobs in Scala using DF.
  • For Jupyter integration and interactive exploration, Python DF calling Scala UDFs is workable, but with a noticeable performance hit.
  • In a crunch, write UDFs in Python to work with Python DF, but it's really going to be slow.

@ruebot
Copy link
Member

ruebot commented Jul 17, 2019

@lintool is this issue still relevant? Or shall I close it?

@lintool
Copy link
Member Author

lintool commented Jul 27, 2019

Yup, let's close.

@lintool lintool closed this as completed Jul 27, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants
@ruebot @lintool @TitusAn and others