-
Notifications
You must be signed in to change notification settings - Fork 6.5k
[DataFrame] Implement to_csv #2014
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Test PASSed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks pretty good. Left a few comments.
|
||
from pandas.io.common import (_get_handle, UnicodeWriter, _expand_user, | ||
_stringify_path) | ||
# from pandas._libs import writers as libwriters |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please remove dead code.
from pandas.core.indexes.period import PeriodIndex | ||
|
||
|
||
class CSVFormatter(object): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the motivation for writing our own CSVFormatter
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Initial motivation was to write partitions in parallel. I'll see if I can use the Pandas CSVFormatter to write to different offsets of the same file without re-implementing.
self._save() | ||
|
||
finally: | ||
# GH 17778 handles compression for byte strings. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like a Pandas artifact.
Test FAILed. |
Test PASSed. |
Test PASSed. |
Test PASSed. |
Test FAILed. |
Test PASSed. |
Test PASSed. |
bug fixes Unify dtypes on DataFrame creation Formatting and comments Cache dtypes Fix bug in _merge_dtypes Fix bug Changed caching logic Fix dtypes issue in read_csv Invalidate dtypes cache when inserting column Simplify unifying dtypes and improve caching Fix typo Better caching of dtypes Fix merge conflicts Implemented some to_csv functions Support read_csv from buffers Expose date_range, NaT, Timedelta from pandas Add testing utils Redirect imports to Pandas Fix imports Fix read_csv when index_col is specified Update imports from Pandas Fix bugs Use util API Fix nasty bug Add missing import Don't distribute reading of compressed files Add test utilities for Pandas tests Add test for to_csv Add warnings Fix rebase artifacts
Remove testing imports
Test PASSed. |
@devin-petersohn All tests passed on private Travis. |
Merged, thanks @pschafhalter! |
* master: (22 commits) [xray] Fix bug in updating actor execution dependencies (ray-project#2064) [DataFrame] Refactor __delitem__ (ray-project#2080) [xray] Better error messaging when pulling from self. (ray-project#2068) Use source code in hash where possible (fix ray-project#2089) (ray-project#2090) Functions for flushing done tasks and evicted objects. (ray-project#2033) Fix compilation error for RAY_USE_NEW_GCS with latest clang. (ray-project#2086) [xray] Corrects Error Handling During Push and Pull. (ray-project#2059) [xray] Sophisticated task dependency management (ray-project#2035) Support calling positional arguments by keyword (fix ray-project#998) (ray-project#2081) [DataFrame] Improve performance of iteration methods (ray-project#2026) [DataFrame] Implement to_csv (ray-project#2014) [xray] Lineage cache only requests notifications about remote parent tasks (ray-project#2066) [rllib] Add magic methods for rollouts (ray-project#2024) [DataFrame] Allows DataFrame constructor to take in another DataFrame (ray-project#2072) Pin Pandas version for Travis to 0.22 (ray-project#2075) Fix python linting (ray-project#2076) [xray] Fix GCS table prefixes (ray-project#2065) Some tests for _submit API. (ray-project#2062) [rllib] Queue lib for python 2.7 (ray-project#2057) [autoscaler] Remove faulty assert that breaks during downscaling, pull configs from env (ray-project#2006) ...
What do these changes do?
to_csv
read_csv
_IndexMetadata
The CSV formatter code is probably most prone to errors. The reference Pandas implementation is available at https://github.com/pandas-dev/pandas/blob/0.22.x/pandas/io/formats/format.py.
Also, I borrowed much of the CSV formatter code from Pandas. I'm not sure what best practice here is -- should I add a reference to the Pandas BSD 3 license in the Ray license file like RLlib did? Alternatively, reimplement the CSV formatter code, but that would take several days.