-
Notifications
You must be signed in to change notification settings - Fork 5.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DataFrame] Fixing the code formatting of the tests #2123
[DataFrame] Fixing the code formatting of the tests #2123
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great!
We should also consider turning on yapf for DataFrames (in a separate PR, see https://github.com/ray-project/ray/blob/master/.travis/yapf.sh#L16) |
Test FAILed. |
Jenkins, retest this please. |
Test PASSed. |
aeffe55
to
9b022cc
Compare
Test PASSed. |
Test PASSed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work! Left comments on minor stylistic changes.
Overall, I think we have a bit of a problem in terms of stylistic consistency. For example, df
can refer to Pandas or Ray dataframes. Sometimes we use pd_df
(which is strange now that we import ray.dataframe as pd
), other times ray_df
, and sometimes expected
.
Also, sometimes we use single quotes or double quotes. If not in this one, we might want to fix this in a future PR.
I also think that we might want to move all tests that accept dataframes as arguments to a specific section of the file or even another file.
from pandas.tests.frame.common import TestData | ||
import ray.dataframe as pd |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In many other files, we still use import pandas as pd
. We'll have to update those quickly to maintain a consistent code base. Maybe create an issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is an open PR: #1942
@pytest.fixture | ||
def test_get_ftype_counts(ray_df, pandas_df): | ||
assert(ray_df.get_ftype_counts().equals(pandas_df.get_ftype_counts())) | ||
return to_pandas(ray_df1).equals(to_pandas(ray_df2)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to convert to pandas? Ray dataframes implement equals
, so shouldn't ray_df1.equals(ray_df2)
be equivalent and more efficient?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
equals
doesn't work properly, we have to use this for now.
'col4': [12, 13, 14, 15], | ||
'col5': [0, 0, 0, 0]}) | ||
ray_df = from_pandas(pandas_df, 2) | ||
pandas_df = pandas.DataFrame({'col1': [0, 1, 2, 3], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like it could be refactored to use something like create_test_dataframe()
'col3': [8.0, 9.0, 10.0, 11.0], | ||
'col4': [12.0, 13.0, 14.0, 15.0], | ||
'col5': [0.0, 0.0, 0.0, 0.0]}) | ||
pandas_df = pandas.DataFrame({'col1': [0.0, 1.0, 2.0, 3.0], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This too, but cast to float
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As part of the robustness effort, it's probably not a good idea to use the same dataframe across all the tests. Ideally, the test dataframes will be changed.
'col5': [0, 0, 0, 0]}) | ||
|
||
return from_pandas(df, 2) | ||
return pd.DataFrame({'col1': [0, 1, 2, 3], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This creates an int
dataframe. It's possible that tests which use this function won't catch bugs for float
, datetime
, and other objects.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At some point we need to do a rewrite of the tests, this is a known issue
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see an issue for this. Would you like to make one?
Anyway, it doesn't hurt to make the current tests more robust.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need a good design doc on testing as well (robust and future proof, should be self-contained but does not need to be as exhaustive as pandas test, a good pytest structure that enable tests to run in parallel...). Can you draft one? In the meantime, I don't think we should refactor the tests. This is just a code formatting PR after all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Made the issue at: #2134
assert ray_df.memory_usage(index=True).at['Index'] is not None | ||
assert ray_df.memory_usage(deep=True).sum() >= \ | ||
ray_df.memory_usage(deep=False).sum() | ||
|
||
|
||
def test_merge(): | ||
ray_df = rdf.DataFrame({"col1": [0, 1, 2, 3], "col2": [4, 5, 6, 7], | ||
"col3": [8, 9, 0, 1], "col4": [2, 4, 5, 6]}) | ||
ray_df = pd.DataFrame({"col1": [0, 1, 2, 3], "col2": [4, 5, 6, 7], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See above.
|
||
ray_df2 = rdf.DataFrame({"col1": [0, 1, 2], "col2": [1, 5, 6]}) | ||
ray_df2 = pd.DataFrame({"col1": [0, 1, 2], "col2": [1, 5, 6]}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See above.
[1.5, np.nan, 3.], | ||
[1.5, np.nan, 3], | ||
[1.5, np.nan, 3]]) | ||
df = pd.DataFrame([[1.5, np.nan, 3.], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think changing this one is necessary. It's used once.
pd_df = pd.DataFrame({'year': [2015, 2016], | ||
'month': [2, 3], | ||
'day': [4, 5]}) | ||
ray_df = pd.DataFrame({'year': [2015, 2016], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See above.
'C': [1, 2, 3]}) | ||
|
||
assert ray_df_equals_pandas(rdf.get_dummies(ray_df), pd.get_dummies(pd_df)) | ||
ray_df = pd.DataFrame({'A': ['a', 'b', 'a'], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See above.
I think it makes sense to fix the issues that this PR directly touches, given this is just a formatting fix, this probably isn't the right place to try to do a full refactor of the test code. I particularly disagree with the comments on code that this PR does not touch, they should be fixed at some point. But this is not the right place. I'll create an issue for tracking purposes. But, at some point we should refactor all of the test code together to make it collectively more robust. |
7fefadd
to
de8f8d7
Compare
Test PASSed. |
Test PASSed. |
Merged. Thanks @devin-petersohn! |
* master: [autoscaler] GCP node provider (ray-project#2061) [xray] Evict tasks from the lineage cache (ray-project#2152) [ASV] Add ray.init and simple Ray benchmarks (ray-project#2166) Re-encrypt key for uploading to S3 from travis to use travis-ci.com. (ray-project#2169) [rllib] Fix A3C PyTorch implementation (ray-project#2036) [JavaWorker] Do not kill local-scheduler-forked workers in RunManager.cleanup (ray-project#2151) Update Travis CI badge from travis-ci.org to travis-ci.com. (ray-project#2155) Implement Python global state API for xray. (ray-project#2125) [xray] Improve flush algorithm for the lineage cache (ray-project#2130) Fix support for actor classmethods (ray-project#2146) Add empty df test (ray-project#1879) [JavaWorker] Enable java worker support (ray-project#2094) [DataFrame] Fixing the code formatting of the tests (ray-project#2123) Update resource documentation (remove outdated limitations). (ray-project#2022) bugfix: use array redis_primary_addr out of its scope (ray-project#2139) Fix infinite retry in Push function. (ray-project#2133) [JavaWorker] Changes to the directory under src for support java worker (ray-project#2093) Integrate credis with Ray & route task table entries into credis. (ray-project#1841)
There were a significant number of unnecessary extra lines, formatting errors/inconsistencies, etc.
The content of the tests did not change (with the exception of
quantile
getting additional percentiles to test). Rather the format of the tests are now such that it the tests are in a more maintainable order.Also for the tests only
import pandas as pd
toimport pandas
andimport ray.dataframe as rdf
toimport ray.dataframe as pd