Support upcoming default pandas string dtype (pandas >= 3) #930

jorisvandenbossche · 2024-08-22T13:46:52Z

Pandas decided to introduce a default string dtype (which will be used by default instead of object-dtype when inferring values to be strings), see https://pandas.pydata.org/pdeps/0014-string-dtype.html for the details (and pandas-dev/pandas#54792 for progress of implementation).

This is already available in the main branch of pandas (and will also be in am upcoming 2.3 release) behind a feature flag pd.options.future.infer_string = True.

Right now, if you enable this flag (with nightly version of pandas) and use fastparquet to write a dataframe with a string column, this errors as follows (because fastparquet is not yet aware of the new dtype):

In [1]: pd.options.future.infer_string = True

In [2]: df = pd.DataFrame({"a": ["some", "strings"]})

In [3]: df.dtypes
Out[3]: 
a    str
dtype: object

In [4]: df.to_parquet("test_new_string_dtype.parquet", engine="fastparquet")
...
File ~/conda/envs/dev/lib/python3.11/site-packages/fastparquet/writer.py:904, in make_metadata(data, has_nulls, ignore_columns, fixed_text, object_encoding, times, index_cols, partition_cols, cols_dtype)
    902     se.name = column
    903 else:
--> 904     se, type = find_type(data[column], fixed_text=fixed,
    905                          object_encoding=oencoding, times=times,
    906                          is_index=is_index)
    907 col_has_nulls = has_nulls
    908 if has_nulls is None:

File ~/conda/envs/dev/lib/python3.11/site-packages/fastparquet/writer.py:222, in find_type(data, fixed_text, object_encoding, times, is_index)
    218     type, converted_type, width = (parquet_thrift.Type.BYTE_ARRAY,
    219                                    parquet_thrift.ConvertedType.UTF8,
    220                                    None)
    221 else:
--> 222     raise ValueError("Don't know how to convert data type: %s" % dtype)
    223 se = parquet_thrift.SchemaElement(
    224     name=norm_col_name(data.name, is_index), type_length=width,
    225     converted_type=converted_type, type=type,
   (...)
    228     i32=True
    229 )
    230 return se, type

ValueError: Don't know how to convert data type: str

The text was updated successfully, but these errors were encountered:

martindurant · 2024-08-22T13:48:59Z

With this type, are the values still python strings?

jorisvandenbossche · 2024-08-22T13:50:19Z

The values are either object-dtype with python strings (or np.nan for missing values) or either a pyarrow array, depending on the .storage attribute of the dtype.
(and we will default to use pyarrow if it is installed)

jorisvandenbossche · 2024-08-22T13:51:14Z

But, regardless of the exact storage, if you just want to have Python strings you can always do something like to_numpy(dtype=object) and then you don't have to care about the exact storage

martindurant · 2024-08-22T13:59:06Z

if you just want to have Python strings

I want to pre-allocate a dataframe and fill in the values as they are read. That model probably doesn't work anymore for arrow-backed data more complex than the equivalent numpy array.

#931 shows the possible future evolution of fastparquet where we no longer use pandas at all...

jorisvandenbossche · 2024-08-22T14:09:58Z

(FWIW, pandas is not going to hard require pyarrow for pandas 3.0, that decision is postponed until a later release. But regardless of that, having less pandas-specific code here sounds certainly worthwhile)

Preallocating probably won't work for the arrow-backed data indeed. But I would say you can always read the strings as you do now (preallocating an object-dtype array, I assume?) and do any conversion afterwards (or leave that to pandas to do so)

martindurant · 2024-08-22T14:29:13Z

you can always read the strings as you do now

Probably we'll continue to produce numpy object columns while we can, but we still have to deal with the str type when writing.

I'll get back to you on the two issues, thanks for letting me know.

martindurant mentioned this issue Aug 29, 2024

Some compatibility fixes #933

Merged

martindurant closed this as completed in #933 Oct 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support upcoming default pandas string dtype (pandas >= 3) #930

Support upcoming default pandas string dtype (pandas >= 3) #930

jorisvandenbossche commented Aug 22, 2024 •

edited

Loading

martindurant commented Aug 22, 2024

jorisvandenbossche commented Aug 22, 2024

jorisvandenbossche commented Aug 22, 2024

martindurant commented Aug 22, 2024

jorisvandenbossche commented Aug 22, 2024

martindurant commented Aug 22, 2024 •

edited

Loading

Support upcoming default pandas string dtype (pandas >= 3) #930

Support upcoming default pandas string dtype (pandas >= 3) #930

Comments

jorisvandenbossche commented Aug 22, 2024 • edited Loading

martindurant commented Aug 22, 2024

jorisvandenbossche commented Aug 22, 2024

jorisvandenbossche commented Aug 22, 2024

martindurant commented Aug 22, 2024

jorisvandenbossche commented Aug 22, 2024

martindurant commented Aug 22, 2024 • edited Loading

jorisvandenbossche commented Aug 22, 2024 •

edited

Loading

martindurant commented Aug 22, 2024 •

edited

Loading