Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support upcoming default pandas string dtype (pandas >= 3) #930

Closed
jorisvandenbossche opened this issue Aug 22, 2024 · 6 comments · Fixed by #933
Closed

Support upcoming default pandas string dtype (pandas >= 3) #930

jorisvandenbossche opened this issue Aug 22, 2024 · 6 comments · Fixed by #933

Comments

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Aug 22, 2024

Pandas decided to introduce a default string dtype (which will be used by default instead of object-dtype when inferring values to be strings), see https://pandas.pydata.org/pdeps/0014-string-dtype.html for the details (and pandas-dev/pandas#54792 for progress of implementation).

This is already available in the main branch of pandas (and will also be in am upcoming 2.3 release) behind a feature flag pd.options.future.infer_string = True.

Right now, if you enable this flag (with nightly version of pandas) and use fastparquet to write a dataframe with a string column, this errors as follows (because fastparquet is not yet aware of the new dtype):

In [1]: pd.options.future.infer_string = True

In [2]: df = pd.DataFrame({"a": ["some", "strings"]})

In [3]: df.dtypes
Out[3]: 
a    str
dtype: object

In [4]: df.to_parquet("test_new_string_dtype.parquet", engine="fastparquet")
...
File ~/conda/envs/dev/lib/python3.11/site-packages/fastparquet/writer.py:904, in make_metadata(data, has_nulls, ignore_columns, fixed_text, object_encoding, times, index_cols, partition_cols, cols_dtype)
    902     se.name = column
    903 else:
--> 904     se, type = find_type(data[column], fixed_text=fixed,
    905                          object_encoding=oencoding, times=times,
    906                          is_index=is_index)
    907 col_has_nulls = has_nulls
    908 if has_nulls is None:

File ~/conda/envs/dev/lib/python3.11/site-packages/fastparquet/writer.py:222, in find_type(data, fixed_text, object_encoding, times, is_index)
    218     type, converted_type, width = (parquet_thrift.Type.BYTE_ARRAY,
    219                                    parquet_thrift.ConvertedType.UTF8,
    220                                    None)
    221 else:
--> 222     raise ValueError("Don't know how to convert data type: %s" % dtype)
    223 se = parquet_thrift.SchemaElement(
    224     name=norm_col_name(data.name, is_index), type_length=width,
    225     converted_type=converted_type, type=type,
   (...)
    228     i32=True
    229 )
    230 return se, type

ValueError: Don't know how to convert data type: str
@martindurant
Copy link
Member

With this type, are the values still python strings?

@jorisvandenbossche
Copy link
Member Author

The values are either object-dtype with python strings (or np.nan for missing values) or either a pyarrow array, depending on the .storage attribute of the dtype.
(and we will default to use pyarrow if it is installed)

@jorisvandenbossche
Copy link
Member Author

But, regardless of the exact storage, if you just want to have Python strings you can always do something like to_numpy(dtype=object) and then you don't have to care about the exact storage

@martindurant
Copy link
Member

if you just want to have Python strings

I want to pre-allocate a dataframe and fill in the values as they are read. That model probably doesn't work anymore for arrow-backed data more complex than the equivalent numpy array.

#931 shows the possible future evolution of fastparquet where we no longer use pandas at all...

@jorisvandenbossche
Copy link
Member Author

(FWIW, pandas is not going to hard require pyarrow for pandas 3.0, that decision is postponed until a later release. But regardless of that, having less pandas-specific code here sounds certainly worthwhile)

Preallocating probably won't work for the arrow-backed data indeed. But I would say you can always read the strings as you do now (preallocating an object-dtype array, I assume?) and do any conversion afterwards (or leave that to pandas to do so)

@martindurant
Copy link
Member

martindurant commented Aug 22, 2024

you can always read the strings as you do now

Probably we'll continue to produce numpy object columns while we can, but we still have to deal with the str type when writing.

I'll get back to you on the two issues, thanks for letting me know.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants