Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DOC: Design drafts to assist with next-gen pandas internals discussion #13944

Closed
wants to merge 8 commits into from

Conversation

wesm
Copy link
Member

@wesm wesm commented Aug 9, 2016

Sprinted on my accumulated design ideas. There's more to do, but I'm going to take a break for a bit.

I published this as a sphinx site temporarily to https://wesm.github.io/pandas2-design/ so it's easier to read. This needs to be a team effort, so depending on where the discussions go perhaps we can build a set of documents that we can make PRs to and discuss specific matters either in GitHub issues or on the mailing list. Let me know what everybody thinks!

@codecov-io
Copy link

Current coverage is 85.30% (diff: 100%)

Merging #13944 into master will not change coverage

@@             master     #13944   diff @@
==========================================
  Files           139        139          
  Lines         50157      50157          
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
  Hits          42785      42785          
  Misses         7372       7372          
  Partials          0          0          

Powered by Codecov. Last update b7abef4...c7819cf

@gfyoung
Copy link
Member

gfyoung commented Aug 9, 2016

On the topic of cleaning up API's, IMO there should be a section about the read_csv API. This API is a giant monolith and supporting two implementations doesn't make much sense anymore given that their behaviours and support are inconsistent. Defining a new, sleeker interface (most likely based in the C engine) would be nice to have.

@gfyoung
Copy link
Member

gfyoung commented Aug 9, 2016

One general point of observation so far: this coupling with numpy probably will need some more fleshing out, especially when it comes to defining data types in pandas. Finding that good balance between just enough separation from numpy to be able to operate largely independently of it while at the same time providing nice inter-op capabilities between pandas objects and numpy objects is going to be tricky.

@wesm
Copy link
Member Author

wesm commented Aug 9, 2016

@gfyoung I would appreciate some help making very specific analysis (with code examples) of potential concerns around insulating users from NumPy-specific implementation details. For example: I do not believe it is pandas's responsibility to maximize its "substitutability" with ndarrays. It's a "nice to have", but the original "Series is an ndarray" design was a mistake in retrospect.

@wesm
Copy link
Member Author

wesm commented Aug 9, 2016

I agree that coming up with a plan to improve read_csv is a good idea — this is basically orthogonal to the data structure internals. The implementation of read_csv will obviously have to change (in a good way, e.g. "unavoidable consolidation" can be avoided in the new design).

@wesm
Copy link
Member Author

wesm commented Aug 9, 2016

Also, I would appreciate instead of using the term "data type" rather to use either "physical type" or "logical type". In NumPy there is no separation of these concepts (except perhaps with datetime[unit]) and this is big reason why we are in this mess. See https://wesm.github.io/pandas2-design/internal-architecture.html#some-definitions

pandas-specific metadata objects that model the current semantics / behavior of
the project. What does this mean, exactly?

* Each NumPy dtype object will map 1-to-1 to an equivalent ``pandas.DataType``
Copy link
Member

@gfyoung gfyoung Aug 9, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A layout of a proposed pandas types class hierarchy might be useful here, especially if we are trying to delineate cleanly between logical and physical types. pandas.DataType confused me a bit especially in light of your section header.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, to what degree would they be equivalent? Is it just purely semantics? Or is there some sort sort of physical compatibility as well (i.e. we could "convert" between the two if we so choose)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can write down a hierarchy. Any pandas.DataType is a logical type (perhaps we can come up with a better name for this base class?) — all of the pandas metadata objects are logical, with a clear default mapping onto a physical memory representation.

For example:

  • pandas.Int64Type —> numpy.int64 plus a pandas internal bitmap
  • pandas.Float64Type —> numpy.float64
  • pandas.CategoricalType —> one of numpy.int8 to numpy.int64 depending on the categories
  • pandas.StringType —> dictionary-encoded UTF8 representation described in the docs

and so on. I'll add it to the document — the main point is that there's a 1-1 mapping from NumPy's physical types onto pandas's logical types, but the mapping from pandas to NumPy may not by 1 to 1, and may not map onto NumPy at all without an explicit lossy cast (instead of the implicit lossy casts happening right now: the example cited in the docs is missing data in integer arrays).

Copy link
Member Author

@wesm wesm Aug 9, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another thing that isn't explicitly called out in these docs:

  • The pandas logical types are only metadata
  • NumPy's physical dtypes are metadata plus a vtable of C functions (the f attribute on the PyArray_Descr object) that implement algorithms on those types

The resolution of a particular function to invoke on pandas data will depend on the actual physical memory representation of the data in pandas.

Copy link
Member Author

@wesm wesm Aug 9, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On your other question re: physical compatibility:

https://wesm.github.io/pandas2-design/internal-architecture.html#preserving-numpy-interoperability

  • If the mapping between a pandas logical type and a physical NumPy type is 1 to 1 (for example: float64), then you can convert to ndarray and back without copying data
  • Some logical types will not be representable as ndarray without a lossy conversion. For example: a pandas integer array containing nulls can be converted to ndarray but this conversion will be "lossy" (unless you use numpy.ma). For example, you may choose the current behavior which is conversion to float64 (this is already a one-way trip for the data — values exceeding 2^53 get destroyed by this)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would add:

  • PeriodDtype here
  • pd.String (the link)
  • future possibilities, e.g. Interval here (which Period is really just a sub-type)

@jreback jreback added Docs API Design Compat pandas objects compatability with Numpy or Python functions labels Aug 9, 2016

At a high level, the "pandas 2.0" effort is based on a number of observations:

* The pandas 0.x series of releases have consisted with huge amounts of
Copy link
Contributor

@TomAugspurger TomAugspurger Aug 9, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This first point muddles the waters a bit (for me); likewise with the point starting on line 79. " Removal of deprecated / underutilized functionality".

Further down you seem to split between

  1. API changes that are the result of pandas fixing 0.X implementation details (e.g. integer-NA), and
  2. API changes that would be because the original idea may be flawed / out of scope (.ix plotting).

Does it make sense to limit this document entirely to the internals refactoring, and only talk about API changes of the first kind?

EDIT: I realize I didn't say why I thought the discussion should be limited to just the first kind. I worry discussions about arbitrary API changes will distract from what is probably the more import issue of the internals refactoring. I imagine there are people on the internet who will raise havoc if you try to take away their DataFrame.plot 😄

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good points. I mainly fixated on .ix because the .loc and .iloc indexing operators will probably need to be reimplemented as part of this internals overhaul, and having to drag along things like .ix would add implementation burden for unclear benefit. I agree talking about other refactoring / cleaning is a distraction. Will make some amendments to make this more clear.

The other intent of this first point was that the iterative / agile development style of the project (from its early days until now) has made it difficult to consider large/invasive changes to the internals, and after so much time we are due to seriously contemplate what's working well and what's not working well.

Copy link
Member

@jorisvandenbossche jorisvandenbossche Aug 19, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 on making this distinction. We can also start drafting documents that layout other API changes (not related to internal refactoring) and put those in the same directory (so the 'goals and motivations' can touch both aspects), but in separate PRs

EDIT: I see you already said the same below .. :-)

@chris-b1
Copy link
Contributor

chris-b1 commented Aug 9, 2016

Thanks for putting this together!

A topic that is somewhat orthogonal to the internals, but one thing I've been thinking about in the context of pandas 2.0, is interaction with JIT compilers - mostly for UDFs. I'm using numba as a frame of reference, but thinking generally.

I came to pandas from SAS - there is plenty to dislike about it - but one thing I actually do miss is a feeling of safety, that if you can write it, it will basically be fast. Of course with pandas/numpy you have to be much more defensive, and sometimes try really hard to vectorize - just as an example I remember, this SO answer from @unutbu is really clever, but I would have never come up with it.

I think this need to be vectorized adds to the pandas api size, and I also think adds significantly to the learning curve. I'm sure like many others, I've taken a hard look at Julia - it wasn't there yet (and maybe it won't ever be?) but there is a big allure to that model. As a user, I'd love to be able to just write this, and just have it be fast.

df.groupby('key').transform(lambda x: (x - x.mean()) / x.std())

This may be more of a stretch, but as a contributor who's not a C++ wizard, it would also be nice if some portion of pandas was implementable at a JIT-able level, ala @shoyer's numbagg. I get this may not possible - you probably couldn't even make numba a required dep.

So I was just curious if you had any thoughts / roadmap ideas for this topic? A C/C++ API would open the door a lot of the way, but I think there may be value in tighter integration?


* Create a custom string array container type suitable for use in a
``pandas.Array``, and a ``pandas.string`` logical data type.
* Require that all strings be encoded as UTF-8.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are the benefits of UTF-8 vs Unicode? It is just space? If encoding categorically, would the space become less of an issue?
Would Unicode allay some of the issues below?
And provide easier compatibility with python3?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is that "Unicode" is a generic term, versus UTF-8 refers to a specific encoding. Python's Unicode type uses UTF-8 internally, but also some other metadata (type, reference counter plus reference to the data -- this is the 24 bytes of PyObject overhead) that means creating it requires memory allocation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. I didn't know python's internal representation was variable length. Thanks

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, so it's a bit more overhead than I thought :).

>>> sys.getsizeof(unicode())
52

Copy link
Contributor

@jreback jreback Aug 9, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But in context its not so bad (this is PY2).

In [3]: sys.getsizeof(str())
Out[3]: 33

In [4]: sys.getsizeof(unicode())
Out[4]: 50

of course a string repr as proposed is much less

@wesm
Copy link
Member Author

wesm commented Aug 9, 2016

@chris-b1 several things on the "faster groupby" front:

  • if the inner loops of group-wise operations are happening in compiled code with limited contact wtih the Python interpreter, then the microperformance of each operation will be much better
  • hypothetically (with much effort) one could create a useful enough C API such that the groupby operators could accept a ctypes function pointer -- this is basically (IIUC) how numba is implementing custom ufuncs. obviously the low-level bits of numerical arrays are simpler than pandas objects but the same general approach would be valid.

* **Removal of deprecated / underutilized functionality**: As the Python data
ecosystem has grown, a number of areas of pandas (e.g. plotting and datasets
with more than 2 dimensions) may be better served by other open source
projects. Also, functionality that has been explicitly deprecated or
Copy link
Contributor

@jreback jreback Aug 9, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some where should add: enforce groupby immutability. This has caused so many random issues with the current inference (e.g. someone mutates inside the udf). a COW on pandas objects could detect this much more reliably that the current way (e.g. seeing if the index was copied). Should simply disallow this.

@jreback
Copy link
Contributor

jreback commented Aug 9, 2016

so we have tons of issues related to this, see here.

I will post a mini-list of issue that I am closing.

Column statistics
~~~~~~~~~~~~~~~~~

In quite a few pandas algorithms, there are characteristics of the data that
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do currently calculate on demand and cache some of these properties (at least monotonicity) on the pandas.Index, which we can do because it's immutable. I don't see any difficulty extending that to Series or DataFrame (pandas 2.0 or not) as long as we invalidate on mutation.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we would need some kind of a "dirty" flag for the statistics so that any mutation methods sets dirty = true

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

uniqueness is important in some columns. potentially compression ability (e.g. imagine backing by a chunked-compressed store).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, I'll add that here (since we're already computing this in indexes) -- more expensive than the others since you have to push the data through hash table

@shoyer
Copy link
Member

shoyer commented Aug 10, 2016

As for C vs C++, one factor that I find convenient about C++ is that you have the STL for standard data structures builtin. NumPy actually uses Python's C API when it needs a dict or set internally, for example, which is kind of insane.

@wesm
Copy link
Member Author

wesm commented Aug 10, 2016

Note that the STL containers (like std::unordered_map for hash tables) have pretty acceptable performance:

http://incise.org/hash-table-benchmarks.html

We should obviously do our own investigations, but we always have the option to pull in 3rd party libraries (we might decide to continue using klib or decommission depending on benchmarks).

pandas's semantics and enable the core developers to extend pandas more
cleanly with new data types, data structures, and computational semantics.
* **Exposing a pandas Cython and/or C/C++ API to other Python library
developers**: the internals of Series and DataFrame are only weakly
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's prob worth some thought how to make a clean API between the storage back end, pandas structures and the dispatch (compute) machinery

eg say we wanted to swap in something like: https://github.com/alimanfoo/zarr

this actually in theory should be easy as it has numpy storage / indexing semantics

in a similar way would be nice to define an API to a compute engine (like numexpr or numba)

mainly need a consistent way of handling dtype conversions (this of course is very similar to the problem arrow is designed to solve) except here we could use numpy as the intermediary

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A probable pre-requisite for 3rd-party data structures would be having a C or C++ API (not sure if this one does, but in general), unless they get coerced to NumPy arrays whenever you need to do anything that falls outside their pure Python API.

@wesm
Copy link
Member Author

wesm commented Aug 10, 2016

I'm going to let discussions collect here for a bit, then incorporate feedback into the documents.

It would be useful to start a separate document that is changes/improvements/refactorings that we would like to do in pandas 2.0 that do not involve changes to the internals (like, read_csv was brought up), how do you want to manage that?

@shoyer
Copy link
Member

shoyer commented Aug 13, 2016

One thing that surfaced recently in #13395 that I would like consider for pandas 2.0 is to expose the contiguous one dimensional arrays that store the values of DataFrame colums as part of the public API. These would be similar to the existing Categorical type, but done in more systematic manner and returned by pandas objects when you call Series.values and Series.unique().

This would be complementary to existing scalar types like Period/Datetime/Timestamp. If the existing scalar type has the name pandas.Scalar, I would call the array version pandas.ScalarArray, inheriting from pandas.BaseArray.

Having such an API would be useful for third party libraries that integrate with pandas in a deep way (such as seaborn, sklearn, statsmodels and xarray), because it doesn't always make sense to work with labeled values.

@wesm
Copy link
Member Author

wesm commented Aug 15, 2016

@shoyer I'm totally with you here. This isn't explicitly called out in these docs but that was kind of what I intended here:

https://wesm.github.io/pandas2-design/internal-architecture.html#pandas-array-types

So basically every logical type (whether Int32Array, StringArray, CategoricalArray, etc.) would have a corresponding public Python API. These would also have __array__ for NumPy interoperability as able.

Notably, this is the way that PostgreSQL handles null values. For example, we
might have:

.. code-block::
Copy link
Contributor

@chrisaycock chrisaycock Aug 15, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code block is not showing-up in the generated HTML document. Is a language required?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, apparently code-block does not default even to plain text if you don't indicate the language. I'll fix soon on my pass through to incorporate all the feedback here

@jreback jreback added this to the 2.0 milestone Aug 16, 2016
@jreback
Copy link
Contributor

jreback commented Aug 16, 2016

I realigned the milestones with some 'approx' dates.

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Aug 19, 2016

@wesm Thanks for the extensive explanations!

Two small things I wondered (have to read it further in more detail):

  • There has been discussion about the the datetime resolution (and resulting span) which is limited. I suppose with the better logical dtype system, it should be possible to tackle this? But maybe it is worth mentioning it somewhere?
  • Also regarding the dtype system, to what extent would it be possible to make defining custom dtypes externally work? (other libraries that provide a dtype that can be used in pandas dataframes)

@jorisvandenbossche
Copy link
Member

It would be useful to start a separate document that is changes/improvements/refactorings that we would like to do in pandas 2.0 that do not involve changes to the internals (like, read_csv was brought up), how do you want to manage that?

Agreed.
Would it be worth to create a separate repo where those documents can live, together with the ones of this PR? (a pandas-design, pandas-discussion, pandas-dev, ...) This way, we can also use the issues/PRs over there to discuss the several topics without adding to the already huge number of open issues in the pandas repo, keeping it a bit separated.
But this has the disadvantage that when we start with actual code PRs, it will again be in the pandas repo.

Otherwise, I would just add documents describing those possible changes to this same pandas-2.0 directory.

@wesm
Copy link
Member Author

wesm commented Aug 20, 2016

@jorisvandenbossche I'd be happy to create a pandas-design repo — this has the added benefit of not bloating the main pandas git history if we check in images or other assets. Otherwise we can keep working in this doc/pandas-2.0 directory and indicate which docs are "internals-related" and which are "stuff that depends on the internals" (like the CSV reader, indexing, etc.)

@wesm
Copy link
Member Author

wesm commented Aug 23, 2016

I'm going to start incorporating the feedback into this document. I'll write a follow up PR for the Copy-On-Write discussion. I'll indicate in the document pages that are Internals related and those that are separate from Internals and maybe we can make more PRs digging into some of the other refactoring / cleanup we want to do.

Let's start discussing on the mailing list how we intend to proceed with two branches of pandas? I would suggest we should keep the pandas 2.0 branch "rebaseable" for as long as possible, but at some point there will be some necessary divergence.

@jorisvandenbossche
Copy link
Member

What do others think of putting those documents and PRs in a separate repo?

@wesm
Copy link
Member Author

wesm commented Aug 23, 2016

I'm +0 on a separate repo (pydata/pandas-design or something), if only to make the discussion issues more accessible to outsiders (vs. having to seek them out in the main pandas issue tracker)

@wesm
Copy link
Member Author

wesm commented Aug 23, 2016

I'll go ahead and do that. We can always move the docs back here if the process is not working well.

I also would like more people to actively watch the design discussion. Having a dedicated repo will make the GitHub emails separate from the usual pandas firehose.

@wesm
Copy link
Member Author

wesm commented Aug 24, 2016

I've addressed many of the comments in here in wesm/pandas2#1 -- let's move the discussion there and see how it goes. I still need to fix up the github permissions so that all who have push rights here can push there

@wesm wesm closed this Aug 24, 2016
wesm added a commit to wesm/pandas2 that referenced this pull request Aug 24, 2016
This is a revised version of the docs at pandas-dev/pandas#13944

Author: Wes McKinney <wesm@apache.org>

Closes #1 from wesm/pandas-2.0-drafts and squashes the following commits:

fa5ccbf [Wes McKinney] Typo with unsigned integers
30b25b0 [Wes McKinney] Note
608612b [Wes McKinney] Add top level note about WIP status
458ae85 [Wes McKinney] Incorporate some more feedback from github
801259d [Wes McKinney] Make goals section leaner / more concise per comments
94d0281 [Wes McKinney] Add section about logical to physical correspondence
7a44f8f [Wes McKinney] Add requirements.txt
c0f080b [Wes McKinney] Move over design drafts from main pandas repo
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Compat pandas objects compatability with Numpy or Python functions Docs
Projects
None yet
Development

Successfully merging this pull request may close these issues.