-
Couldn't load subscription status.
- Fork 3.9k
ARROW-2689: [Python] Remove parameter timestamps_to_ms #2129
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report
@@ Coverage Diff @@
## master #2129 +/- ##
==========================================
- Coverage 86.39% 86.37% -0.03%
==========================================
Files 242 230 -12
Lines 41481 40589 -892
==========================================
- Hits 35838 35059 -779
+ Misses 5643 5530 -113Continue to review full report at Codecov.
|
|
So we should run https://arrow.apache.org/docs/python/generated/pyarrow.Column.html#pyarrow.Column.cast after converting a data frame to arrow. 👍 |
|
@domoritz could you elaborate on your use case a bit more? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
|
I'm trying to convert some data from pandas to arrow but pandas' timestamps are in ns. I want to reduce the data size and use lower precision. My code looks roughly like this: df = pd.read_csv('flights.csv', encoding='utf-8', dtype={'FL_DATE': 'str', 'ARR_TIME': 'str', 'DEP_TIME': 'str'})
arr_time = df.FL_DATE + df.ARR_TIME.replace('2400', '0000')
data['ARRIVAL'] = pd.to_datetime(arr_time, format='%Y%m%d%H%M')
dep_time = df.FL_DATE + df.DEP_TIME.replace('2400', '0000')
data['DEPARTURE'] = pd.to_datetime(dep_time, format='%Y%m%d%H%M')
df = df.astype({'DEP_DELAY': 'int16', 'ARR_DELAY': 'int16', 'AIR_TIME': 'int16', 'DISTANCE': 'int16'})
table = pa.Table.from_pandas(df)
table.column('ARRIVAL').cast(pa.TimestampValue, True)
writer = pa.RecordBatchFileWriter(f'{name}.arrow', table.schema)
writer.write(table)
writer.close() |
|
Okay. In this line:
Are you trying to cast that column a different timestamp unit? This line of code leaves It would be a good idea to add a documentation section about type casting and how to change the column type of a table; I don't think we have that right now. We could also add some convenience APIs to help with common workflows (e.g. replacing a single column) |
Yes, I am trying to switch to ns to ms accuracy. I guess I have to write something like |
|
Okay, let's create a JIRA about this and discuss there. Firstly, the statement or depending on whether you want to allow unsafe casts (see http://arrow.apache.org/docs/python/generated/pyarrow.lib.Array.html#pyarrow.lib.Array.cast). I think the docstring could be improved to make more clear that a DataType instance is expected rather than a class object. Secondly, we don't have a convenient function for replacing a column in a table to create a new table. So I would want to write: |
|
Thank you @wesm! I hope my comments are helpful. |
This parameter is no longer existent. For the Parquet path it was replaced by
coerce_timestamps, other cases should useColumn.cast().