-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DOC: Design drafts to assist with next-gen pandas internals discussion #13944
Conversation
Current coverage is 85.30% (diff: 100%)@@ master #13944 diff @@
==========================================
Files 139 139
Lines 50157 50157
Methods 0 0
Messages 0 0
Branches 0 0
==========================================
Hits 42785 42785
Misses 7372 7372
Partials 0 0
|
On the topic of cleaning up API's, IMO there should be a section about the |
One general point of observation so far: this coupling with |
@gfyoung I would appreciate some help making very specific analysis (with code examples) of potential concerns around insulating users from NumPy-specific implementation details. For example: I do not believe it is pandas's responsibility to maximize its "substitutability" with ndarrays. It's a "nice to have", but the original "Series is an ndarray" design was a mistake in retrospect. |
I agree that coming up with a plan to improve |
Also, I would appreciate instead of using the term "data type" rather to use either "physical type" or "logical type". In NumPy there is no separation of these concepts (except perhaps with |
pandas-specific metadata objects that model the current semantics / behavior of | ||
the project. What does this mean, exactly? | ||
|
||
* Each NumPy dtype object will map 1-to-1 to an equivalent ``pandas.DataType`` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A layout of a proposed pandas types class hierarchy might be useful here, especially if we are trying to delineate cleanly between logical and physical types. pandas.DataType
confused me a bit especially in light of your section header.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, to what degree would they be equivalent? Is it just purely semantics? Or is there some sort sort of physical compatibility as well (i.e. we could "convert" between the two if we so choose)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can write down a hierarchy. Any pandas.DataType
is a logical type (perhaps we can come up with a better name for this base class?) — all of the pandas metadata objects are logical, with a clear default mapping onto a physical memory representation.
For example:
pandas.Int64Type
—>numpy.int64
plus a pandas internal bitmappandas.Float64Type
—>numpy.float64
pandas.CategoricalType
—> one ofnumpy.int8
tonumpy.int64
depending on the categoriespandas.StringType
—> dictionary-encoded UTF8 representation described in the docs
and so on. I'll add it to the document — the main point is that there's a 1-1 mapping from NumPy's physical types onto pandas's logical types, but the mapping from pandas to NumPy may not by 1 to 1, and may not map onto NumPy at all without an explicit lossy cast (instead of the implicit lossy casts happening right now: the example cited in the docs is missing data in integer arrays).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another thing that isn't explicitly called out in these docs:
- The pandas logical types are only metadata
- NumPy's physical dtypes are metadata plus a vtable of C functions (the
f
attribute on thePyArray_Descr
object) that implement algorithms on those types
The resolution of a particular function to invoke on pandas data will depend on the actual physical memory representation of the data in pandas.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On your other question re: physical compatibility:
https://wesm.github.io/pandas2-design/internal-architecture.html#preserving-numpy-interoperability
- If the mapping between a pandas logical type and a physical NumPy type is 1 to 1 (for example:
float64
), then you can convert to ndarray and back without copying data - Some logical types will not be representable as
ndarray
without a lossy conversion. For example: a pandas integer array containing nulls can be converted to ndarray but this conversion will be "lossy" (unless you usenumpy.ma
). For example, you may choose the current behavior which is conversion tofloat64
(this is already a one-way trip for the data — values exceeding 2^53 get destroyed by this)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
||
At a high level, the "pandas 2.0" effort is based on a number of observations: | ||
|
||
* The pandas 0.x series of releases have consisted with huge amounts of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This first point muddles the waters a bit (for me); likewise with the point starting on line 79. " Removal of deprecated / underutilized functionality".
Further down you seem to split between
- API changes that are the result of pandas fixing 0.X implementation details (e.g. integer-NA), and
- API changes that would be because the original idea may be flawed / out of scope (
.ix
plotting
).
Does it make sense to limit this document entirely to the internals refactoring, and only talk about API changes of the first kind?
EDIT: I realize I didn't say why I thought the discussion should be limited to just the first kind. I worry discussions about arbitrary API changes will distract from what is probably the more import issue of the internals refactoring. I imagine there are people on the internet who will raise havoc if you try to take away their DataFrame.plot
😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good points. I mainly fixated on .ix
because the .loc
and .iloc
indexing operators will probably need to be reimplemented as part of this internals overhaul, and having to drag along things like .ix
would add implementation burden for unclear benefit. I agree talking about other refactoring / cleaning is a distraction. Will make some amendments to make this more clear.
The other intent of this first point was that the iterative / agile development style of the project (from its early days until now) has made it difficult to consider large/invasive changes to the internals, and after so much time we are due to seriously contemplate what's working well and what's not working well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 on making this distinction. We can also start drafting documents that layout other API changes (not related to internal refactoring) and put those in the same directory (so the 'goals and motivations' can touch both aspects), but in separate PRs
EDIT: I see you already said the same below .. :-)
Thanks for putting this together! A topic that is somewhat orthogonal to the internals, but one thing I've been thinking about in the context of pandas 2.0, is interaction with JIT compilers - mostly for UDFs. I'm using I came to pandas from SAS - there is plenty to dislike about it - but one thing I actually do miss is a feeling of safety, that if you can write it, it will basically be fast. Of course with pandas/numpy you have to be much more defensive, and sometimes try really hard to vectorize - just as an example I remember, this SO answer from @unutbu is really clever, but I would have never come up with it. I think this need to be vectorized adds to the pandas api size, and I also think adds significantly to the learning curve. I'm sure like many others, I've taken a hard look at Julia - it wasn't there yet (and maybe it won't ever be?) but there is a big allure to that model. As a user, I'd love to be able to just write this, and just have it be fast. df.groupby('key').transform(lambda x: (x - x.mean()) / x.std()) This may be more of a stretch, but as a contributor who's not a C++ wizard, it would also be nice if some portion of pandas was implementable at a JIT-able level, ala @shoyer's numbagg. I get this may not possible - you probably couldn't even make So I was just curious if you had any thoughts / roadmap ideas for this topic? A C/C++ API would open the door a lot of the way, but I think there may be value in tighter integration? |
|
||
* Create a custom string array container type suitable for use in a | ||
``pandas.Array``, and a ``pandas.string`` logical data type. | ||
* Require that all strings be encoded as UTF-8. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are the benefits of UTF-8 vs Unicode? It is just space? If encoding categorically, would the space become less of an issue?
Would Unicode allay some of the issues below?
And provide easier compatibility with python3?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding is that "Unicode" is a generic term, versus UTF-8 refers to a specific encoding. Python's Unicode type uses UTF-8 internally, but also some other metadata (type, reference counter plus reference to the data -- this is the 24 bytes of PyObject overhead) that means creating it requires memory allocation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. I didn't know python's internal representation was variable length. Thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, so it's a bit more overhead than I thought :).
>>> sys.getsizeof(unicode())
52
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But in context its not so bad (this is PY2).
In [3]: sys.getsizeof(str())
Out[3]: 33
In [4]: sys.getsizeof(unicode())
Out[4]: 50
of course a string repr as proposed is much less
@chris-b1 several things on the "faster groupby" front:
|
* **Removal of deprecated / underutilized functionality**: As the Python data | ||
ecosystem has grown, a number of areas of pandas (e.g. plotting and datasets | ||
with more than 2 dimensions) may be better served by other open source | ||
projects. Also, functionality that has been explicitly deprecated or |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some where should add: enforce groupby immutability. This has caused so many random issues with the current inference (e.g. someone mutates inside the udf). a COW on pandas objects could detect this much more reliably that the current way (e.g. seeing if the index was copied). Should simply disallow this.
so we have tons of issues related to this, see here. I will post a mini-list of issue that I am closing. |
Column statistics | ||
~~~~~~~~~~~~~~~~~ | ||
|
||
In quite a few pandas algorithms, there are characteristics of the data that |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do currently calculate on demand and cache some of these properties (at least monotonicity) on the pandas.Index
, which we can do because it's immutable. I don't see any difficulty extending that to Series or DataFrame (pandas 2.0 or not) as long as we invalidate on mutation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, we would need some kind of a "dirty" flag for the statistics so that any mutation methods sets dirty = true
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
uniqueness is important in some columns. potentially compression ability (e.g. imagine backing by a chunked-compressed store).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, I'll add that here (since we're already computing this in indexes) -- more expensive than the others since you have to push the data through hash table
As for C vs C++, one factor that I find convenient about C++ is that you have the STL for standard data structures builtin. NumPy actually uses Python's C API when it needs a dict or set internally, for example, which is kind of insane. |
Note that the STL containers (like http://incise.org/hash-table-benchmarks.html We should obviously do our own investigations, but we always have the option to pull in 3rd party libraries (we might decide to continue using klib or decommission depending on benchmarks). |
pandas's semantics and enable the core developers to extend pandas more | ||
cleanly with new data types, data structures, and computational semantics. | ||
* **Exposing a pandas Cython and/or C/C++ API to other Python library | ||
developers**: the internals of Series and DataFrame are only weakly |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's prob worth some thought how to make a clean API between the storage back end, pandas structures and the dispatch (compute) machinery
eg say we wanted to swap in something like: https://github.com/alimanfoo/zarr
this actually in theory should be easy as it has numpy storage / indexing semantics
in a similar way would be nice to define an API to a compute engine (like numexpr or numba)
mainly need a consistent way of handling dtype conversions (this of course is very similar to the problem arrow is designed to solve) except here we could use numpy as the intermediary
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A probable pre-requisite for 3rd-party data structures would be having a C or C++ API (not sure if this one does, but in general), unless they get coerced to NumPy arrays whenever you need to do anything that falls outside their pure Python API.
I'm going to let discussions collect here for a bit, then incorporate feedback into the documents. It would be useful to start a separate document that is changes/improvements/refactorings that we would like to do in pandas 2.0 that do not involve changes to the internals (like, read_csv was brought up), how do you want to manage that? |
One thing that surfaced recently in #13395 that I would like consider for pandas 2.0 is to expose the contiguous one dimensional arrays that store the values of DataFrame colums as part of the public API. These would be similar to the existing This would be complementary to existing scalar types like Having such an API would be useful for third party libraries that integrate with pandas in a deep way (such as seaborn, sklearn, statsmodels and xarray), because it doesn't always make sense to work with labeled values. |
@shoyer I'm totally with you here. This isn't explicitly called out in these docs but that was kind of what I intended here: https://wesm.github.io/pandas2-design/internal-architecture.html#pandas-array-types So basically every logical type (whether Int32Array, StringArray, CategoricalArray, etc.) would have a corresponding public Python API. These would also have |
Notably, this is the way that PostgreSQL handles null values. For example, we | ||
might have: | ||
|
||
.. code-block:: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code block is not showing-up in the generated HTML document. Is a language required?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, apparently code-block does not default even to plain text if you don't indicate the language. I'll fix soon on my pass through to incorporate all the feedback here
I realigned the milestones with some 'approx' dates. |
@wesm Thanks for the extensive explanations! Two small things I wondered (have to read it further in more detail):
|
Agreed. Otherwise, I would just add documents describing those possible changes to this same |
@jorisvandenbossche I'd be happy to create a pandas-design repo — this has the added benefit of not bloating the main pandas git history if we check in images or other assets. Otherwise we can keep working in this doc/pandas-2.0 directory and indicate which docs are "internals-related" and which are "stuff that depends on the internals" (like the CSV reader, indexing, etc.) |
I'm going to start incorporating the feedback into this document. I'll write a follow up PR for the Copy-On-Write discussion. I'll indicate in the document pages that are Internals related and those that are separate from Internals and maybe we can make more PRs digging into some of the other refactoring / cleanup we want to do. Let's start discussing on the mailing list how we intend to proceed with two branches of pandas? I would suggest we should keep the pandas 2.0 branch "rebaseable" for as long as possible, but at some point there will be some necessary divergence. |
What do others think of putting those documents and PRs in a separate repo? |
I'm +0 on a separate repo (pydata/pandas-design or something), if only to make the discussion issues more accessible to outsiders (vs. having to seek them out in the main pandas issue tracker) |
I'll go ahead and do that. We can always move the docs back here if the process is not working well. I also would like more people to actively watch the design discussion. Having a dedicated repo will make the GitHub emails separate from the usual pandas firehose. |
I've addressed many of the comments in here in wesm/pandas2#1 -- let's move the discussion there and see how it goes. I still need to fix up the github permissions so that all who have push rights here can push there |
This is a revised version of the docs at pandas-dev/pandas#13944 Author: Wes McKinney <wesm@apache.org> Closes #1 from wesm/pandas-2.0-drafts and squashes the following commits: fa5ccbf [Wes McKinney] Typo with unsigned integers 30b25b0 [Wes McKinney] Note 608612b [Wes McKinney] Add top level note about WIP status 458ae85 [Wes McKinney] Incorporate some more feedback from github 801259d [Wes McKinney] Make goals section leaner / more concise per comments 94d0281 [Wes McKinney] Add section about logical to physical correspondence 7a44f8f [Wes McKinney] Add requirements.txt c0f080b [Wes McKinney] Move over design drafts from main pandas repo
Sprinted on my accumulated design ideas. There's more to do, but I'm going to take a break for a bit.
I published this as a sphinx site temporarily to https://wesm.github.io/pandas2-design/ so it's easier to read. This needs to be a team effort, so depending on where the discussions go perhaps we can build a set of documents that we can make PRs to and discuss specific matters either in GitHub issues or on the mailing list. Let me know what everybody thinks!