Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DataFrame] Fixing the code formatting of the tests #2123

Merged

Conversation

devin-petersohn
Copy link
Member

There were a significant number of unnecessary extra lines, formatting errors/inconsistencies, etc.

The content of the tests did not change (with the exception of quantile getting additional percentiles to test). Rather the format of the tests are now such that it the tests are in a more maintainable order.

Also for the tests only import pandas as pd to import pandas and import ray.dataframe as rdf to import ray.dataframe as pd

Copy link
Contributor

@kunalgosar kunalgosar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great!

@pcmoritz
Copy link
Contributor

We should also consider turning on yapf for DataFrames (in a separate PR, see https://github.com/ray-project/ray/blob/master/.travis/yapf.sh#L16)

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5580/
Test FAILed.

@devin-petersohn
Copy link
Member Author

Jenkins, retest this please.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5604/
Test PASSed.

@devin-petersohn devin-petersohn force-pushed the df_fix_test_formatting branch from aeffe55 to 9b022cc Compare May 24, 2018 14:40
@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5620/
Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5621/
Test PASSed.

Copy link
Contributor

@pschafhalter pschafhalter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work! Left comments on minor stylistic changes.

Overall, I think we have a bit of a problem in terms of stylistic consistency. For example, df can refer to Pandas or Ray dataframes. Sometimes we use pd_df (which is strange now that we import ray.dataframe as pd), other times ray_df, and sometimes expected.

Also, sometimes we use single quotes or double quotes. If not in this one, we might want to fix this in a future PR.

I also think that we might want to move all tests that accept dataframes as arguments to a specific section of the file or even another file.

from pandas.tests.frame.common import TestData
import ray.dataframe as pd
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In many other files, we still use import pandas as pd. We'll have to update those quickly to maintain a consistent code base. Maybe create an issue?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is an open PR: #1942

@pytest.fixture
def test_get_ftype_counts(ray_df, pandas_df):
assert(ray_df.get_ftype_counts().equals(pandas_df.get_ftype_counts()))
return to_pandas(ray_df1).equals(to_pandas(ray_df2))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to convert to pandas? Ray dataframes implement equals, so shouldn't ray_df1.equals(ray_df2) be equivalent and more efficient?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

equals doesn't work properly, we have to use this for now.

'col4': [12, 13, 14, 15],
'col5': [0, 0, 0, 0]})
ray_df = from_pandas(pandas_df, 2)
pandas_df = pandas.DataFrame({'col1': [0, 1, 2, 3],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like it could be refactored to use something like create_test_dataframe()

'col3': [8.0, 9.0, 10.0, 11.0],
'col4': [12.0, 13.0, 14.0, 15.0],
'col5': [0.0, 0.0, 0.0, 0.0]})
pandas_df = pandas.DataFrame({'col1': [0.0, 1.0, 2.0, 3.0],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This too, but cast to float.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As part of the robustness effort, it's probably not a good idea to use the same dataframe across all the tests. Ideally, the test dataframes will be changed.

'col5': [0, 0, 0, 0]})

return from_pandas(df, 2)
return pd.DataFrame({'col1': [0, 1, 2, 3],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This creates an int dataframe. It's possible that tests which use this function won't catch bugs for float, datetime, and other objects.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At some point we need to do a rewrite of the tests, this is a known issue

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see an issue for this. Would you like to make one?

Anyway, it doesn't hurt to make the current tests more robust.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need a good design doc on testing as well (robust and future proof, should be self-contained but does not need to be as exhaustive as pandas test, a good pytest structure that enable tests to run in parallel...). Can you draft one? In the meantime, I don't think we should refactor the tests. This is just a code formatting PR after all.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made the issue at: #2134

assert ray_df.memory_usage(index=True).at['Index'] is not None
assert ray_df.memory_usage(deep=True).sum() >= \
ray_df.memory_usage(deep=False).sum()


def test_merge():
ray_df = rdf.DataFrame({"col1": [0, 1, 2, 3], "col2": [4, 5, 6, 7],
"col3": [8, 9, 0, 1], "col4": [2, 4, 5, 6]})
ray_df = pd.DataFrame({"col1": [0, 1, 2, 3], "col2": [4, 5, 6, 7],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above.


ray_df2 = rdf.DataFrame({"col1": [0, 1, 2], "col2": [1, 5, 6]})
ray_df2 = pd.DataFrame({"col1": [0, 1, 2], "col2": [1, 5, 6]})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above.

[1.5, np.nan, 3.],
[1.5, np.nan, 3],
[1.5, np.nan, 3]])
df = pd.DataFrame([[1.5, np.nan, 3.],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think changing this one is necessary. It's used once.

pd_df = pd.DataFrame({'year': [2015, 2016],
'month': [2, 3],
'day': [4, 5]})
ray_df = pd.DataFrame({'year': [2015, 2016],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above.

'C': [1, 2, 3]})

assert ray_df_equals_pandas(rdf.get_dummies(ray_df), pd.get_dummies(pd_df))
ray_df = pd.DataFrame({'A': ['a', 'b', 'a'],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above.

@kunalgosar
Copy link
Contributor

I think it makes sense to fix the issues that this PR directly touches, given this is just a formatting fix, this probably isn't the right place to try to do a full refactor of the test code. I particularly disagree with the comments on code that this PR does not touch, they should be fixed at some point. But this is not the right place.

I'll create an issue for tracking purposes. But, at some point we should refactor all of the test code together to make it collectively more robust.

@devin-petersohn devin-petersohn force-pushed the df_fix_test_formatting branch from 7fefadd to de8f8d7 Compare May 24, 2018 23:38
@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5625/
Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5626/
Test PASSed.

@pschafhalter pschafhalter merged commit 74cca3b into ray-project:master May 26, 2018
@pschafhalter
Copy link
Contributor

Merged. Thanks @devin-petersohn!

alok added a commit to alok/ray that referenced this pull request Jun 3, 2018
* master:
  [autoscaler] GCP node provider (ray-project#2061)
  [xray] Evict tasks from the lineage cache (ray-project#2152)
  [ASV] Add ray.init and simple Ray benchmarks (ray-project#2166)
  Re-encrypt key for uploading to S3 from travis to use travis-ci.com. (ray-project#2169)
  [rllib] Fix A3C PyTorch implementation (ray-project#2036)
  [JavaWorker] Do not kill local-scheduler-forked workers in RunManager.cleanup (ray-project#2151)
  Update Travis CI badge from travis-ci.org to travis-ci.com. (ray-project#2155)
  Implement Python global state API for xray. (ray-project#2125)
  [xray] Improve flush algorithm for the lineage cache (ray-project#2130)
  Fix support for actor classmethods (ray-project#2146)
  Add empty df test (ray-project#1879)
  [JavaWorker] Enable java worker support (ray-project#2094)
  [DataFrame] Fixing the code formatting of the tests (ray-project#2123)
  Update resource documentation (remove outdated limitations). (ray-project#2022)
  bugfix: use array redis_primary_addr out of its scope (ray-project#2139)
  Fix infinite retry in Push function. (ray-project#2133)
  [JavaWorker] Changes to the directory under src for support java worker (ray-project#2093)
  Integrate credis with Ray & route task table entries into credis. (ray-project#1841)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants