API/ENH: dtype='string' / pd.String

*update for 2019-10-07: We have a StringDtype extension dtype. It's memory model is the same as the old implementation, an object-dtype ndarray of strings. The next step is to store & process it natively*.

---

xref #8627 
xref #8643, #8350

Since we introduced `Categorical` in 0.15.0, I think we have found 2 main uses.

1) as a 'real' Categorical/Factor type to represent a limited of subset of values that the column can take on
2) as a memory saving representation for object dtypes.

I could see introducting a `dtype='string'` where `String` is a slightly specialized sub-class of `Categroical`, with 2 differences compared to a 'regular' Categorical:
- it allows unions of arbitrary other string types, currently `Categorical` will complain if you do this:

```
In [1]: df = DataFrame({'A' : Series(list('abc'),dtype='category')})
In [2]: df2 = DataFrame({'A' : Series(list('abd'),dtype='category')})
In [3]: pd.concat([df,df2])
ValueError: incompatible levels in categorical block merge
```

Note that this works if they are `Series` (and prob should raise as well, side -issue)

But, if these were both 'string' dtypes, then its a simple matter to combine (efficiently).
- you can restrict the 'sub-dtype' (e.g. the dtype of the categories) to `string/unicode` (iow, don't allow numbers / arbitrary objects), makes the constructor a bit simpler, but more importantly, you now have a 'real' non-object string dtype.

I don't think this would be that complicated to do. The big change here would be to essentially convert any object dtypes that are strings to `dtype='string'` e.g. on reading/conversion/etc. might be a perf issue for some things, but I think the memory savings greatly outweigh.

We would then have a 'real' looking object dtype (and `object` would be relegated to actual python object types, so would be used much less).

cc @shoyer
cc @JanSchulz
cc @jorisvandenbossche 
cc @mwiebe 
thoughts?


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

API/ENH: dtype='string' / pd.String #8640

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

API/ENH: dtype='string' / pd.String #8640

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions