Skip to content

[DataFrame] Implement to_csv #2014

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 15 commits into from
May 17, 2018
Merged

Conversation

pschafhalter
Copy link
Contributor

What do these changes do?

  • Implemented to_csv
    • Includes implementation of the CSV formatter.
  • Fix bugs in read_csv
  • Fix bug in _IndexMetadata
  • Add imports to the dataframe package.

The CSV formatter code is probably most prone to errors. The reference Pandas implementation is available at https://github.com/pandas-dev/pandas/blob/0.22.x/pandas/io/formats/format.py.

Also, I borrowed much of the CSV formatter code from Pandas. I'm not sure what best practice here is -- should I add a reference to the Pandas BSD 3 license in the Ray license file like RLlib did? Alternatively, reimplement the CSV formatter code, but that would take several days.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5254/
Test PASSed.

Copy link
Member

@devin-petersohn devin-petersohn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks pretty good. Left a few comments.


from pandas.io.common import (_get_handle, UnicodeWriter, _expand_user,
_stringify_path)
# from pandas._libs import writers as libwriters
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove dead code.

from pandas.core.indexes.period import PeriodIndex


class CSVFormatter(object):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the motivation for writing our own CSVFormatter?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initial motivation was to write partitions in parallel. I'll see if I can use the Pandas CSVFormatter to write to different offsets of the same file without re-implementing.

self._save()

finally:
# GH 17778 handles compression for byte strings.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a Pandas artifact.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5295/
Test FAILed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5308/
Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5312/
Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5321/
Test PASSed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5359/
Test FAILed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5391/
Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5435/
Test PASSed.

bug fixes

Unify dtypes on DataFrame creation

Formatting and comments

Cache dtypes

Fix bug in _merge_dtypes

Fix bug

Changed caching logic

Fix dtypes issue in read_csv

Invalidate dtypes cache when inserting column

Simplify unifying dtypes and improve caching

Fix typo

Better caching of dtypes

Fix merge conflicts

Implemented some to_csv functions

Support read_csv from buffers

Expose date_range, NaT, Timedelta from pandas

Add testing utils

Redirect imports to Pandas

Fix imports

Fix read_csv when index_col is specified

Update imports from Pandas

Fix bugs

Use util API

Fix nasty bug

Add missing import

Don't distribute reading of compressed files

Add test utilities for Pandas tests

Add test for to_csv

Add warnings

Fix rebase artifacts
@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5450/
Test PASSed.

@pschafhalter
Copy link
Contributor Author

@devin-petersohn All tests passed on private Travis.

@devin-petersohn devin-petersohn merged commit ae17ebd into ray-project:master May 17, 2018
@devin-petersohn
Copy link
Member

Merged, thanks @pschafhalter!

@pschafhalter pschafhalter deleted the df-to-csv branch May 17, 2018 21:39
alok added a commit to alok/ray that referenced this pull request May 18, 2018
* master: (22 commits)
  [xray] Fix bug in updating actor execution dependencies (ray-project#2064)
  [DataFrame] Refactor __delitem__ (ray-project#2080)
  [xray] Better error messaging when pulling from self. (ray-project#2068)
  Use source code in hash where possible (fix ray-project#2089) (ray-project#2090)
  Functions for flushing done tasks and evicted objects. (ray-project#2033)
  Fix compilation error for RAY_USE_NEW_GCS with latest clang. (ray-project#2086)
  [xray] Corrects Error Handling During Push and Pull. (ray-project#2059)
  [xray] Sophisticated task dependency management (ray-project#2035)
  Support calling positional arguments by keyword (fix ray-project#998) (ray-project#2081)
  [DataFrame] Improve performance of iteration methods (ray-project#2026)
  [DataFrame] Implement to_csv (ray-project#2014)
  [xray] Lineage cache only requests notifications about remote parent tasks (ray-project#2066)
  [rllib] Add magic methods for rollouts (ray-project#2024)
  [DataFrame] Allows DataFrame constructor to take in another DataFrame (ray-project#2072)
  Pin Pandas version for Travis to 0.22 (ray-project#2075)
  Fix python linting (ray-project#2076)
  [xray] Fix GCS table prefixes (ray-project#2065)
  Some tests for _submit API. (ray-project#2062)
  [rllib] Queue lib for python 2.7 (ray-project#2057)
  [autoscaler] Remove faulty assert that breaks during downscaling, pull configs from env (ray-project#2006)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants