-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
API/ENH: dtype='string' / pd.String #8640
Comments
I think it would be a very nice improvement to have a real 'string' dtype in pandas. However, I don't know if this should be 'coupled' to categorical. Maybe that is only a technical implementation detail, but for me it should just be a string dtype, a dtype that holds string values, and has in essence nothing to do with categorical. If I think about a string dtype, I am more thinking about numpy's strings types (but it has of course also impractialities, that is has fixed sizes), or the CHAR/VARCHAR in sql. |
I'm of two minds about this. This could be quite useful, but on the other hand, it would be way better if this could be done upstream in numpy or dynd. Pandas specific array types are not great for compatibility with the broader ecosystem. I understand there are good reasons it may not be feasible to implement this upstream (#8350), but these solutions do feel very stop-gap. For example, if @teoliphant is right that dynd could be hooked up in the near future to replace numpy in pandas internals, I would be much more excited about exploring that possibility. As for this specific proposal:
|
So I have tagged a related issue, about including integer NA support by using can you maybe explain a bit about the tradeoffs involved with representing strings in 2 ways using
cc @teoliphant |
I've had in mind an intention to tweak the string representation in dynd slightly, and have written that up now. libdynd/libdynd#158 The vlen string in dynd does work presently, but it has slightly different properties than what I'm writing here. Properties that this vlen string has are a 16 byte representation, using the small string optimization. This means strings with size <= 15 bytes encoded as utf-8 will fit in that memory. Bigger strings will involve a dynamic memory allocation per string, a little bit like Python's string, but with the utf-8 encoding and knowledge that it is a string instead of having to go through dynamic dispatch like in numpy object arrays of strings. Representing strings as a dynd categorical is a bit more complicated, and wouldn't be dynamically updatable in the same way. The types in dynd are immutable, so a categorical type, once created, has a fixed memory layout, etc. This allows for optimized storage, e.g. if the total number of categories is <= 256, each element can be stored as one byte in the array, but does not allow the assignment of a new string that was not already a in the array of categories. |
The issue mentioned in the last comment is now at libdynd/libdynd#158 |
Is there any opinion to work this in 0.19? Hopefully I have some time during the summer:) There are few comments in #13827, and I think it's OK if it can be done without breaking existing user's code. Though we may need some breaking change in 2.0, but the same limitation should be applied to |
I think want to release 0.19.0 shortly (RC in couple of weeks). So let's slate this for next major release (which will be 1.0, rather than 0.20.0) I think. |
yep, but let me try this weekend. of course it's ok to put it off to 1.0 if there is no time to review:) |
@sinhrks hey I think a real-string pandas dtype would be great. would allow us to be much more string about object dtype. |
How much work / additional code complexity would this require? I see this as a "nice to have" rather than something that adds fundamentally new functionality to the library |
maybe @sinhrks can comment more here, but I think at the very least this allows for quite some code simplification. We will then know w/o having to contstantly infer whether something is all strings or includes actual I think it could be done w/o changing much top-level API (e.g. adding another pandas dtype), we have most of this machinery already done. |
My concern is that it may introduce new user APIs / semantics which may be in the line of fire for future API breakage. If the immediate user benefits (vs. developer benefits) warrant this risk then it may be worth it |
I worked a little for this, and currently expect minimum API change. Because it is being like a I assume the impl consists from 2 parts, and mostly done by re-using / cleaning-up the current codes:
I agree that we shouldn't force users/devs to unnecessary migration cost. I expect it can be achieved by minimizing |
Thanks for that write-up Tom! @xhochy wouldn't it be possible to provide the same end-user experience of mutability as we have now? |
Yes, just with a different performance feel as you described. |
I wonder if it makes sense to have a stringarray module for Python, that uses the arrow spec but does not have an arrow dependency. Pandas and vaex could use that, or other projects that work with arrays of strings. In vaex, almost all of the string operations are implemented in C++ (utf8 support and regex), it would not be a bad idea to split off that library. The code needs a cleanup, but it's pretty well tested, and pretty fast: https://towardsdatascience.com/vaex-a-dataframe-with-super-strings-789b92e8d861 I don't have a ton of resources to put in this, but I think it will not cost me much time. If there is serious interest in this (someone from pandas wants to do the pandas part), I'm happy to put in some hours. Ideally, I'd like to see a clean c++ header-only library, that this library (pystringarray) and arrow could use, possibly build on xtensor (cc @SylvainCorlay @wolfv), but that can be considered an implementation default (as long as the API and the memory model stay the same). |
I think Arrow also plans to have some string processing methods at some point, and would welcome contributions. So that could also be a place to have such functionality live. |
In vaex-core, currently (because we were future compatible due to 32bit limitation) we are not depending on arrow, although the string memory layout is arrow compatible. The vaex-arrow package is required for loading/writing with arrow files/streams, so it's an optional dependency, vaex-core does not need it. I think now, we could have a pyarrow dependency for vaex-core, although we'll inherit all installation issue that might come with it (not much experience with it), so I'm still not 100% sure (I read there were windows wheel issues). But the same approach can be used by other libraries, such as a hypothetical pystringarray package, which would follow the arrow spec, and expose its buffers, but not have a direct pyarrow dependency. Another approach, discussed with @xhochy is to have a c++ library (c++ could use a header only string and stringarray library), possibly build in xtensor or compatible with. This library could be something that arrow could use, and possible pystringarray could use. My point is, I think if general algorithms (especially string algos) go into arrow, it will be 'lost' for use outside of arrow, because it's such a big dependency. |
Arrow is only a large dependency if you build all the optional components. I'm concerned there's some FUD being spread here about this topic -- I think it is important to develop a collaborative community that is working together on this (with open community governance) and ensure that downstream consumers can reuse the code that they need without being burdened by optional dependencies. |
We are taking two measures in Apache Arrow to make it easier for third party projects to take on the project as a dependency:
|
There is a bit of a divide between people who are uncomfortable with e.g. having second-order dependencies, and being uncomfortable with a large monolithical dependency. Having a large tree of dependencies between small packages is very well adressed by a package manager. It allows a separation of concerns between components, and the teams developing them, as soon as APIs and extension points are well-defined. This has been the path of Project Jupyter since the Big Split (tm). Monolithical projects make me somewhat more uncomfortable in general. I rarelly am interested in everything in a large monolithical project... The way we have been doing stuff in the xtensor stack is recommending the use of a package manager. We maintain the the conda packages but xtensor packages have been packaged for Fedora, Arch Linux etc. |
I assure you that we hear your concerns and we will do everything we can to address them in time but it will not happen overnight. Our top priority is ensuring that our developer/contributor community is as productive as possible. Based on our contribution graph I would say we have done a good job of this. The area where we have made the most progress on modular installs actually is in our .deb and .yum packages. https://github.com/apache/arrow/tree/master/dev/tasks/linux-packages/debian With recent improvements to conda / conda-forge, we can similarly achieve modularization, at least at the C++ package level. To have modular Python installs will not be easy. We need help from more people to figure out how to address this from a tooling perspective. The current solution is optimized for developer productivity, so we have to make sure that any changes that are made to the packaging process don't make things much more difficult for contributors. |
So until this enhancement is implemented (and adopted by most users via upgrading the library), what is the fastest way to check if a series with dtype object only consists of strings? For example, I have the following series with dtype object and want to detect if there are any non-string values: series = pd.Series(["string" for i in range(1_000)])
series.loc[0] = 1
def series_has_nonstring_values(series):
# TODO: how to implement this efficiently?
return False
assert series_has_nonstring_values(series) is True I hope that this is the right place to address this issue/question? |
@8080labs with the current public API, you can use
There is a faster |
closed via #27949 |
There is still relevant discussion here on the second part of this enhancement: a native storage (Tom also updated the top comment to reflect this) |
After learning more about the goal of Apache Arrow, vaex will happily depend on it in the (near?) future. I want to ignore the discussion on where the c++ string library code should live (in or outside arrow), not to get sidetracked. I'm happy to spend a bit of my time to see if I can move algorithms and unit tests to Apache Arrow, but it would be good if some pandas/arrow devs could assist me a bit (I believe @xhochy offered me help once, does that offer still stand?). Vaex's string API is modeled on Pandas (80-90% compatible), so my guess is that Pandas should be able to make use of this move to Arrow, since it could simply forward many of the string method calls directly to Arrow once the algorithms are moved. In short:
|
Thanks for the update @maartenbreddels. Speaking for myself (not pandas-dev) I don't have a strong opinion on where these algorithms should live. I think pandas will find a way to use them regardless. Putting them in Arrow is probably convenient since we're dancing around a hard dependency on pyarrow in a few places. I may be wrong, but I don't think any of the core pandas maintainers has C++ experience. One of us could likely help with the Python bindings though, if that'd be helpful. |
I opened #35169 for discussing how we can expose an Arrow-backed StringArray to users. |
@mroeschke closable? |
Yeah I believe the current |
update for 2019-10-07: We have a StringDtype extension dtype. It's memory model is the same as the old implementation, an object-dtype ndarray of strings. The next step is to store & process it natively.
xref #8627
xref #8643, #8350
Since we introduced
Categorical
in 0.15.0, I think we have found 2 main uses.I could see introducting a
dtype='string'
whereString
is a slightly specialized sub-class ofCategroical
, with 2 differences compared to a 'regular' Categorical:Categorical
will complain if you do this:Note that this works if they are
Series
(and prob should raise as well, side -issue)But, if these were both 'string' dtypes, then its a simple matter to combine (efficiently).
string/unicode
(iow, don't allow numbers / arbitrary objects), makes the constructor a bit simpler, but more importantly, you now have a 'real' non-object string dtype.I don't think this would be that complicated to do. The big change here would be to essentially convert any object dtypes that are strings to
dtype='string'
e.g. on reading/conversion/etc. might be a perf issue for some things, but I think the memory savings greatly outweigh.We would then have a 'real' looking object dtype (and
object
would be relegated to actual python object types, so would be used much less).cc @shoyer
cc @JanSchulz
cc @jorisvandenbossche
cc @mwiebe
thoughts?
The text was updated successfully, but these errors were encountered: