-
Couldn't load subscription status.
- Fork 3.9k
Closed
Description
GitHub Issues for Apache Arrow
The issues seems to be that pyarrow.Table.from_pandas will set string (object) columns to null type if the dataframe is empty.
df = pd.DataFrame({'a':[],'b':[],'c':[]}, dtype=object)
df['b'] = df['b'].astype(np.int32)
df['c'] = pd.to_datetime(df['c'])
df.dtypes
>> a object
>> b int32
>> c datetime64[ns]
>> dtype: object
The pyarrow schema is then of null type. Other types (numeric and datetimes) seem to work as expected.
table = pa.Table.from_pandas(tdf, preserve_index=False)
table.schema
>> a: null
>> b: int32
>> c: timestamp[ns]
>> metadata
>> --------
>> {b'pandas': b'{"index_columns": [], "column_indexes": [], "columns": [{"name":'
>> b' "a", "field_name": "a", "pandas_type": "empty", "numpy_type": "'
>> b'object", "metadata": null}, {"name": "b", "field_name": "b", "pa'
>> b'ndas_type": "int32", "numpy_type": "int32", "metadata": null}, {'
>> b'"name": "c", "field_name": "c", "pandas_type": "datetime", "nump'
>> b'y_type": "datetime64[ns]", "metadata": null}], "pandas_version":'
>> b' "0.23.0"}'}
You can then modify that particular field to be a pyarrow.string() type.
t2 = pa.string()
fields = [pa.field('a', t2)]
s=pa.schema(fields)
table = pa.Table.from_pandas(tdf, schema=s, preserve_index=False)
table.schema
>> a: string
>> b: int32
>> c: timestamp[ns]
>> metadata
>> --------
>> {b'pandas': b'{"index_columns": [], "column_indexes": [], "columns": [{"name":'
>> b' "a", "field_name": "a", "pandas_type": "unicode", "numpy_type":'
>> b' "object", "metadata": null}, {"name": "b", "field_name": "b", "'
>> b'pandas_type": "int32", "numpy_type": "int32", "metadata": null},'
>> b' {"name": "c", "field_name": "c", "pandas_type": "datetime", "nu'
>> b'mpy_type": "datetime64[ns]", "metadata": null}], "pandas_version'
>> b'": "0.23.0"}'}
This seems to affect only empty dataframes.
Metadata
Metadata
Assignees
Labels
No labels