Skip to content

Conversation

@xhochy
Copy link
Member

@xhochy xhochy commented Jun 10, 2018

This parameter is no longer existent. For the Parquet path it was replaced by coerce_timestamps, other cases should use Column.cast().

@codecov-io
Copy link

Codecov Report

Merging #2129 into master will decrease coverage by 0.02%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #2129      +/-   ##
==========================================
- Coverage   86.39%   86.37%   -0.03%     
==========================================
  Files         242      230      -12     
  Lines       41481    40589     -892     
==========================================
- Hits        35838    35059     -779     
+ Misses       5643     5530     -113
Impacted Files Coverage Δ
rust/src/list.rs
rust/src/error.rs
rust/src/array.rs
rust/src/builder.rs
rust/src/memory.rs
rust/src/list_builder.rs
rust/src/datatypes.rs
rust/src/bitmap.rs
rust/src/record_batch.rs
rust/src/memory_pool.rs
... and 2 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6df28d3...38c5c48. Read the comment docs.

@domoritz
Copy link
Member

So we should run https://arrow.apache.org/docs/python/generated/pyarrow.Column.html#pyarrow.Column.cast after converting a data frame to arrow. 👍

@wesm
Copy link
Member

wesm commented Jun 11, 2018

@domoritz could you elaborate on your use case a bit more?

Copy link
Member

@wesm wesm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@wesm wesm closed this in 34890cc Jun 11, 2018
@wesm wesm deleted the ARROW-2689 branch June 11, 2018 21:05
@domoritz
Copy link
Member

I'm trying to convert some data from pandas to arrow but pandas' timestamps are in ns. I want to reduce the data size and use lower precision.

My code looks roughly like this:

df = pd.read_csv('flights.csv', encoding='utf-8', dtype={'FL_DATE': 'str', 'ARR_TIME': 'str', 'DEP_TIME': 'str'})

arr_time = df.FL_DATE + df.ARR_TIME.replace('2400', '0000')
data['ARRIVAL'] = pd.to_datetime(arr_time, format='%Y%m%d%H%M')

dep_time = df.FL_DATE + df.DEP_TIME.replace('2400', '0000')
data['DEPARTURE'] = pd.to_datetime(dep_time, format='%Y%m%d%H%M')

df = df.astype({'DEP_DELAY': 'int16', 'ARR_DELAY': 'int16', 'AIR_TIME': 'int16', 'DISTANCE': 'int16'})

table = pa.Table.from_pandas(df)

table.column('ARRIVAL').cast(pa.TimestampValue, True)

writer = pa.RecordBatchFileWriter(f'{name}.arrow', table.schema)
writer.write(table)
writer.close()

@wesm
Copy link
Member

wesm commented Jun 11, 2018

Okay. In this line:

table.column('ARRIVAL').cast(pa.TimestampValue, True)

Are you trying to cast that column a different timestamp unit? This line of code leaves table unmodified (data structures from the pyarrow library are immutable). All timestamps use the same amount of space (8 bytes per value)

It would be a good idea to add a documentation section about type casting and how to change the column type of a table; I don't think we have that right now. We could also add some convenience APIs to help with common workflows (e.g. replacing a single column)

@domoritz
Copy link
Member

domoritz commented Jun 11, 2018

Are you trying to cast that column a different timestamp unit?

Yes, I am trying to switch to ns to ms accuracy.

I guess I have to write something like table = table.column('ARRIVAL').cast(pa.TimestampValue, True) instead. Or does this return a column and so I need table.setColumn('ARRIVAL', table.column('ARRIVAL').cast(pa.TimestampValue, True))?

@wesm
Copy link
Member

wesm commented Jun 12, 2018

Okay, let's create a JIRA about this and discuss there.

Firstly, the statement cast(pa.TimestampValue, True) will not do what you want. You either want

column.cast(pa.timestamp('ms'))

or

column.cast(pa.timestamp('ms'), safe=False)

depending on whether you want to allow unsafe casts (see http://arrow.apache.org/docs/python/generated/pyarrow.lib.Array.html#pyarrow.lib.Array.cast). I think the docstring could be improved to make more clear that a DataType instance is expected rather than a class object.

Secondly, we don't have a convenient function for replacing a column in a table to create a new table. So I would want to write:

new_column = table.column(name).cast(pa.timestamp('ms'))
new_table = table.set_column(name, new_column)

I opened https://issues.apache.org/jira/browse/ARROW-2699

@wesm
Copy link
Member

wesm commented Jun 12, 2018

@domoritz
Copy link
Member

Thank you @wesm! I hope my comments are helpful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants