-
Couldn't load subscription status.
- Fork 3.9k
Description
Describe the bug, including details regarding any error messages, version, and platform.
I have a large set of CSV files I want to read with pyarrow. It's too large to fit into memory. So I'm using pyarrow.dataset.dataset to stream it into a parquet file.
I can successfully parse a timestamp like 2016/04/20 10:12:10. But I cannot parse one like 2016/04/20 10:12:10.123 or 2016/04/20 10:12:10.123456, even when I add .%f.
data.csv
a,t
1,2016/04/20 10:12:10.123456
import pyarrow as pa
schema = {
'x': pa.int64(),
't': pa.timestamp('us'),
}
dataset = ds.dataset(
source='data.csv',
format=ds.CsvFileFormat(
convert_options=csv.ConvertOptions(
timestamp_parsers=[
"%Y/%m/%d %H:%M:%S.%f",
"%Y/%m/%d %H:%M:%S",
]
)
),
schema=pyarrow.schema(schema)
)
dataset.to_table().to_pandas()
This results in an error:
ArrowInvalid: Could not open CSV input source '/home/matthew/data/debug/testcsv/data.csv': Invalid: In CSV column #1: Row #2: CSV conversion error to timestamp[us]: invalid value '2016/04/20 10:12:10.123456'
Note that the documentation for timestamp_parsers says:
A sequence of strptime()-compatible format strings, tried in order when attempting to infer or convert timestamp values (the special value ISO8601() can also be given). By default, a fast built-in ISO-8601 parser is used.
And using plain datetime.datetime.strptime does work for this formatting string.
from datetime import datetime
datetime.strptime("2016/04/20 10:12:10.123456", "%Y/%m/%d %H:%M:%S.%f")
If I delete the microsecond component in the CSV, it runs without error. If I also delete the first format string, leaving only the one with .%f, I get an error, as expected. If I try with a CSV without the microsecond component, and with the 2 format strings swapped, it works. This shows the pyarrow is indeed using the format strings I'm trying to give it.
Note that for my real use case my data has only 3 decimal digits, not 6. (Initially I wondered whether %f only works for 6. But plain datetime.strptime works with 3 too.) For my use case I actually don't care if the fractional part is discarded.
Component(s)
Python