-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DataFrame performance comparison: Scala vs. Python #215
Comments
I made a very simple test script to actually see the difference between the vectorized and the non-vectorized versions of Python UDFs. ..it turns out that the vectorized Python UDF is much faster than the non-vectorized version. To prevent cached queries from interfering with the result, the test is repeated 10 times, with the vectorized version being the first one to run (so the invocation of non-vectorized version already has an advantage, if any). Result (no vectorization): 318. Time: 4.68561577797 (sec) |
@TitusAn so (3) actually breaks down into:
I'd like a fair comparison to (1) and (2) - so next step, can you please backport the Python Note that there is already an |
Five tests are conducted: Scala program calls Scala UDF via function (SSF): 5277.5 ms From the graph, it can be shown that Scala UDFs, no matter where they were called, are always the fastest implementation comparing to the equivalent version in Python. It is also found that calling Scala UDF from Python does suffer from overhead of crossing language boundary, and this overhead is around ten to twenty percent. Also, calling registered UDFs in a SQL expression is slower both in Python and Scala, comparing to directly invoking UDFs in Scala or Python scripts, with 'select' method of data frame class. This difference is possibly due to the time it takes to parse and evaluate SQL expressions before they can be acted upon. A detailed reading can be found here: |
@TitusAn This is super awesome! Let's try to run experiments on a larger collection to see if the results hold up. BTW, a bar chart for the above would be more appropriate; you can add 95% confidence intervals to the bars. So, the AUT "best practices" seem to be shaping up to be as follows:
|
@lintool is this issue still relevant? Or shall I close it? |
Yup, let's close. |
There are three different ways we can run DataFrames:
In theory, (2) should be negligibly slower than (1) due to a bit of Python overhead. However, (3) is expected to be significantly slower. There's also a variant of (3) the uses vectorized Python UDFs, which we should investigate also.
Helpful links:
The text was updated successfully, but these errors were encountered: