-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Benchmarking Scala vs Python #121
Comments
|
@ianmilligan1 @greebie can you walk me through your plan for this testing? Bullet-list/number instructions are good. I see 6 things here in the script above. These are the Scala scripts I presume. Then we're doing the same again the Python equivalents? |
My understanding from my conversations with Ryan is the following:
Then
|
Ian has it right. The only thing I would add is that we want to do it for small, medium and perhaps a large script, to see how they scale. (The large script may take too long and use up too many resources for what we have now.) |
These three are probably the best to test for small, medium, and large. 2G 90G 828G If you can provide me some sample output of the scala script above, and the Python script when you're ready, that'd be super helpful. |
Results for YMM Fires and Example. All other aspects of the script check out (same output in each case).
|
@greebie I was looking for the raw output of the script. |
Output here: https://gist.github.com/greebie/e93dae5ba0869de43ef1d635c5bad0ce If you are looking for a --verbose output let me know. I am running a big job right now unfortunately, but will get one out ASAP. |
Python instructions are at https://gist.github.com/ianmilligan1/1436be06a5d2293bf3b6447493c962c3. |
Encountered this lovely problem: https://www.thoughtvector.io/blog/python-3-on-spark-return-of-the-pythonhashseed/ I am going to try and avoid using dictionary-like processes for now. Solution is to |
Would appreciate a review of the following code to test against the above code before I start running the scripts at scale. Thanks!
|
The script produces:
|
First test samples updated above. Seems like Tests nos. 2 & 3 are taking longer than expected. Not sure if this is a design issue or scale issue. Will need to look at these at scale -- will upload soon. |
This all looks good to me, Ryan, seems consistent with the Scala ones. Looking forward to your thoughts! |
Did the python version include pyarrow and pandas_udf ?? I saw many articles said pyspark boost under those setup . |
We are comparing the timings of Scala vs Python to help us make informed decisions on the migration.
The text was updated successfully, but these errors were encountered: