Skip to content

API/ENH: dtype='string' / pd.String #8640

Closed
@jreback

Description

@jreback

update for 2019-10-07: We have a StringDtype extension dtype. It's memory model is the same as the old implementation, an object-dtype ndarray of strings. The next step is to store & process it natively.


xref #8627
xref #8643, #8350

Since we introduced Categorical in 0.15.0, I think we have found 2 main uses.

  1. as a 'real' Categorical/Factor type to represent a limited of subset of values that the column can take on
  2. as a memory saving representation for object dtypes.

I could see introducting a dtype='string' where String is a slightly specialized sub-class of Categroical, with 2 differences compared to a 'regular' Categorical:

  • it allows unions of arbitrary other string types, currently Categorical will complain if you do this:
In [1]: df = DataFrame({'A' : Series(list('abc'),dtype='category')})
In [2]: df2 = DataFrame({'A' : Series(list('abd'),dtype='category')})
In [3]: pd.concat([df,df2])
ValueError: incompatible levels in categorical block merge

Note that this works if they are Series (and prob should raise as well, side -issue)

But, if these were both 'string' dtypes, then its a simple matter to combine (efficiently).

  • you can restrict the 'sub-dtype' (e.g. the dtype of the categories) to string/unicode (iow, don't allow numbers / arbitrary objects), makes the constructor a bit simpler, but more importantly, you now have a 'real' non-object string dtype.

I don't think this would be that complicated to do. The big change here would be to essentially convert any object dtypes that are strings to dtype='string' e.g. on reading/conversion/etc. might be a perf issue for some things, but I think the memory savings greatly outweigh.

We would then have a 'real' looking object dtype (and object would be relegated to actual python object types, so would be used much less).

cc @shoyer
cc @JanSchulz
cc @jorisvandenbossche
cc @mwiebe
thoughts?

Metadata

Metadata

Assignees

No one assigned

    Labels

    EnhancementExtensionArrayExtending pandas with custom dtypes or arrays.PerformanceMemory or execution speed performanceStringsString extension data type and string data

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions