DOC: Design drafts to assist with next-gen pandas internals discussion #13944

wesm · 2016-08-09T03:07:06Z

Sprinted on my accumulated design ideas. There's more to do, but I'm going to take a break for a bit.

I published this as a sphinx site temporarily to https://wesm.github.io/pandas2-design/ so it's easier to read. This needs to be a team effort, so depending on where the discussions go perhaps we can build a set of documents that we can make PRs to and discuss specific matters either in GitHub issues or on the mailing list. Let me know what everybody thinks!

codecov-io · 2016-08-09T04:04:45Z

Current coverage is 85.30% (diff: 100%)

Merging #13944 into master will not change coverage

@@             master     #13944   diff @@
==========================================
  Files           139        139          
  Lines         50157      50157          
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
  Hits          42785      42785          
  Misses         7372       7372          
  Partials          0          0

Powered by Codecov. Last update b7abef4...c7819cf

gfyoung · 2016-08-09T04:59:10Z

On the topic of cleaning up API's, IMO there should be a section about the read_csv API. This API is a giant monolith and supporting two implementations doesn't make much sense anymore given that their behaviours and support are inconsistent. Defining a new, sleeker interface (most likely based in the C engine) would be nice to have.

gfyoung · 2016-08-09T05:04:45Z

One general point of observation so far: this coupling with numpy probably will need some more fleshing out, especially when it comes to defining data types in pandas. Finding that good balance between just enough separation from numpy to be able to operate largely independently of it while at the same time providing nice inter-op capabilities between pandas objects and numpy objects is going to be tricky.

wesm · 2016-08-09T06:34:14Z

@gfyoung I would appreciate some help making very specific analysis (with code examples) of potential concerns around insulating users from NumPy-specific implementation details. For example: I do not believe it is pandas's responsibility to maximize its "substitutability" with ndarrays. It's a "nice to have", but the original "Series is an ndarray" design was a mistake in retrospect.

wesm · 2016-08-09T06:36:07Z

I agree that coming up with a plan to improve read_csv is a good idea — this is basically orthogonal to the data structure internals. The implementation of read_csv will obviously have to change (in a good way, e.g. "unavoidable consolidation" can be avoided in the new design).

wesm · 2016-08-09T06:39:07Z

Also, I would appreciate instead of using the term "data type" rather to use either "physical type" or "logical type". In NumPy there is no separation of these concepts (except perhaps with datetime[unit]) and this is big reason why we are in this mess. See https://wesm.github.io/pandas2-design/internal-architecture.html#some-definitions

gfyoung · 2016-08-09T07:02:47Z

doc/pandas-2.0/source/internal-architecture.rst

+pandas-specific metadata objects that model the current semantics / behavior of
+the project. What does this mean, exactly?
+
+* Each NumPy dtype object will map 1-to-1 to an equivalent ``pandas.DataType``


A layout of a proposed pandas types class hierarchy might be useful here, especially if we are trying to delineate cleanly between logical and physical types. pandas.DataType confused me a bit especially in light of your section header.

Also, to what degree would they be equivalent? Is it just purely semantics? Or is there some sort sort of physical compatibility as well (i.e. we could "convert" between the two if we so choose)?

I can write down a hierarchy. Any pandas.DataType is a logical type (perhaps we can come up with a better name for this base class?) — all of the pandas metadata objects are logical, with a clear default mapping onto a physical memory representation.

For example:

pandas.Int64Type —> numpy.int64 plus a pandas internal bitmap

pandas.Float64Type —> numpy.float64

pandas.CategoricalType —> one of numpy.int8 to numpy.int64 depending on the categories

pandas.StringType —> dictionary-encoded UTF8 representation described in the docs

and so on. I'll add it to the document — the main point is that there's a 1-1 mapping from NumPy's physical types onto pandas's logical types, but the mapping from pandas to NumPy may not by 1 to 1, and may not map onto NumPy at all without an explicit lossy cast (instead of the implicit lossy casts happening right now: the example cited in the docs is missing data in integer arrays).

Another thing that isn't explicitly called out in these docs:

The pandas logical types are only metadata

NumPy's physical dtypes are metadata plus a vtable of C functions (the f attribute on the PyArray_Descr object) that implement algorithms on those types

The resolution of a particular function to invoke on pandas data will depend on the actual physical memory representation of the data in pandas.

On your other question re: physical compatibility:

https://wesm.github.io/pandas2-design/internal-architecture.html#preserving-numpy-interoperability

If the mapping between a pandas logical type and a physical NumPy type is 1 to 1 (for example: float64), then you can convert to ndarray and back without copying data

Some logical types will not be representable as ndarray without a lossy conversion. For example: a pandas integer array containing nulls can be converted to ndarray but this conversion will be "lossy" (unless you use numpy.ma). For example, you may choose the current behavior which is conversion to float64 (this is already a one-way trip for the data — values exceeding 2^53 get destroyed by this)

I would add:

PeriodDtype here

pd.String (the link)

future possibilities, e.g. Interval here (which Period is really just a sub-type)

TomAugspurger · 2016-08-09T12:07:11Z

doc/pandas-2.0/source/goals.rst

+
+At a high level, the "pandas 2.0" effort is based on a number of observations:
+
+* The pandas 0.x series of releases have consisted with huge amounts of


This first point muddles the waters a bit (for me); likewise with the point starting on line 79. " Removal of deprecated / underutilized functionality".

Further down you seem to split between

API changes that are the result of pandas fixing 0.X implementation details (e.g. integer-NA), and

API changes that would be because the original idea may be flawed / out of scope (.ix plotting).

Does it make sense to limit this document entirely to the internals refactoring, and only talk about API changes of the first kind?

EDIT: I realize I didn't say why I thought the discussion should be limited to just the first kind. I worry discussions about arbitrary API changes will distract from what is probably the more import issue of the internals refactoring. I imagine there are people on the internet who will raise havoc if you try to take away their DataFrame.plot 😄

Good points. I mainly fixated on .ix because the .loc and .iloc indexing operators will probably need to be reimplemented as part of this internals overhaul, and having to drag along things like .ix would add implementation burden for unclear benefit. I agree talking about other refactoring / cleaning is a distraction. Will make some amendments to make this more clear.

The other intent of this first point was that the iterative / agile development style of the project (from its early days until now) has made it difficult to consider large/invasive changes to the internals, and after so much time we are due to seriously contemplate what's working well and what's not working well.

+1 on making this distinction. We can also start drafting documents that layout other API changes (not related to internal refactoring) and put those in the same directory (so the 'goals and motivations' can touch both aspects), but in separate PRs

EDIT: I see you already said the same below .. :-)

chris-b1 · 2016-08-09T16:23:02Z

Thanks for putting this together!

A topic that is somewhat orthogonal to the internals, but one thing I've been thinking about in the context of pandas 2.0, is interaction with JIT compilers - mostly for UDFs. I'm using numba as a frame of reference, but thinking generally.

I came to pandas from SAS - there is plenty to dislike about it - but one thing I actually do miss is a feeling of safety, that if you can write it, it will basically be fast. Of course with pandas/numpy you have to be much more defensive, and sometimes try really hard to vectorize - just as an example I remember, this SO answer from @unutbu is really clever, but I would have never come up with it.

I think this need to be vectorized adds to the pandas api size, and I also think adds significantly to the learning curve. I'm sure like many others, I've taken a hard look at Julia - it wasn't there yet (and maybe it won't ever be?) but there is a big allure to that model. As a user, I'd love to be able to just write this, and just have it be fast.

df.groupby('key').transform(lambda x: (x - x.mean()) / x.std())

This may be more of a stretch, but as a contributor who's not a C++ wizard, it would also be nice if some portion of pandas was implementable at a JIT-able level, ala @shoyer's numbagg. I get this may not possible - you probably couldn't even make numba a required dep.

So I was just curious if you had any thoughts / roadmap ideas for this topic? A C/C++ API would open the door a lot of the way, but I think there may be value in tighter integration?

max-sixty · 2016-08-09T16:49:49Z

doc/pandas-2.0/source/strings.rst

+
+* Create a custom string array container type suitable for use in a
+  ``pandas.Array``, and a ``pandas.string`` logical data type.
+* Require that all strings be encoded as UTF-8.


What are the benefits of UTF-8 vs Unicode? It is just space? If encoding categorically, would the space become less of an issue?
Would Unicode allay some of the issues below?
And provide easier compatibility with python3?

My understanding is that "Unicode" is a generic term, versus UTF-8 refers to a specific encoding. Python's Unicode type uses UTF-8 internally, but also some other metadata (type, reference counter plus reference to the data -- this is the 24 bytes of PyObject overhead) that means creating it requires memory allocation.

I see. I didn't know python's internal representation was variable length. Thanks

See https://www.python.org/dev/peps/pep-0393/

OK, so it's a bit more overhead than I thought :).

>>> sys.getsizeof(unicode()) 52

But in context its not so bad (this is PY2).

In [3]: sys.getsizeof(str()) Out[3]: 33 In [4]: sys.getsizeof(unicode()) Out[4]: 50

of course a string repr as proposed is much less

wesm · 2016-08-09T20:46:56Z

@chris-b1 several things on the "faster groupby" front:

if the inner loops of group-wise operations are happening in compiled code with limited contact wtih the Python interpreter, then the microperformance of each operation will be much better
hypothetically (with much effort) one could create a useful enough C API such that the groupby operators could accept a ctypes function pointer -- this is basically (IIUC) how numba is implementing custom ufuncs. obviously the low-level bits of numerical arrays are simpler than pandas objects but the same general approach would be valid.

jreback · 2016-08-09T21:18:10Z

doc/pandas-2.0/source/goals.rst

+* **Removal of deprecated / underutilized functionality**: As the Python data
+  ecosystem has grown, a number of areas of pandas (e.g. plotting and datasets
+  with more than 2 dimensions) may be better served by other open source
+  projects. Also, functionality that has been explicitly deprecated or


some where should add: enforce groupby immutability. This has caused so many random issues with the current inference (e.g. someone mutates inside the udf). a COW on pandas objects could detect this much more reliably that the current way (e.g. seeing if the index was copied). Should simply disallow this.

jreback · 2016-08-09T21:40:40Z

so we have tons of issues related to this, see here.

I will post a mini-list of issue that I am closing.

shoyer · 2016-08-09T22:38:47Z

doc/pandas-2.0/source/removals.rst

+Column statistics
+~~~~~~~~~~~~~~~~~
+
+In quite a few pandas algorithms, there are characteristics of the data that


We do currently calculate on demand and cache some of these properties (at least monotonicity) on the pandas.Index, which we can do because it's immutable. I don't see any difficulty extending that to Series or DataFrame (pandas 2.0 or not) as long as we invalidate on mutation.

Yeah, we would need some kind of a "dirty" flag for the statistics so that any mutation methods sets dirty = true

uniqueness is important in some columns. potentially compression ability (e.g. imagine backing by a chunked-compressed store).

yeah, I'll add that here (since we're already computing this in indexes) -- more expensive than the others since you have to push the data through hash table

shoyer · 2016-08-10T01:19:58Z

As for C vs C++, one factor that I find convenient about C++ is that you have the STL for standard data structures builtin. NumPy actually uses Python's C API when it needs a dict or set internally, for example, which is kind of insane.

wesm · 2016-08-10T01:34:02Z

Note that the STL containers (like std::unordered_map for hash tables) have pretty acceptable performance:

http://incise.org/hash-table-benchmarks.html

We should obviously do our own investigations, but we always have the option to pull in 3rd party libraries (we might decide to continue using klib or decommission depending on benchmarks).

jreback · 2016-08-10T01:46:53Z

doc/pandas-2.0/source/goals.rst

+  pandas's semantics and enable the core developers to extend pandas more
+  cleanly with new data types, data structures, and computational semantics.
+* **Exposing a pandas Cython and/or C/C++ API to other Python library
+  developers**: the internals of Series and DataFrame are only weakly


I think it's prob worth some thought how to make a clean API between the storage back end, pandas structures and the dispatch (compute) machinery

eg say we wanted to swap in something like: https://github.com/alimanfoo/zarr

this actually in theory should be easy as it has numpy storage / indexing semantics

in a similar way would be nice to define an API to a compute engine (like numexpr or numba)

mainly need a consistent way of handling dtype conversions (this of course is very similar to the problem arrow is designed to solve) except here we could use numpy as the intermediary

A probable pre-requisite for 3rd-party data structures would be having a C or C++ API (not sure if this one does, but in general), unless they get coerced to NumPy arrays whenever you need to do anything that falls outside their pure Python API.

wesm · 2016-08-10T17:34:04Z

I'm going to let discussions collect here for a bit, then incorporate feedback into the documents.

It would be useful to start a separate document that is changes/improvements/refactorings that we would like to do in pandas 2.0 that do not involve changes to the internals (like, read_csv was brought up), how do you want to manage that?

shoyer · 2016-08-13T19:56:32Z

One thing that surfaced recently in #13395 that I would like consider for pandas 2.0 is to expose the contiguous one dimensional arrays that store the values of DataFrame colums as part of the public API. These would be similar to the existing Categorical type, but done in more systematic manner and returned by pandas objects when you call Series.values and Series.unique().

This would be complementary to existing scalar types like Period/Datetime/Timestamp. If the existing scalar type has the name pandas.Scalar, I would call the array version pandas.ScalarArray, inheriting from pandas.BaseArray.

Having such an API would be useful for third party libraries that integrate with pandas in a deep way (such as seaborn, sklearn, statsmodels and xarray), because it doesn't always make sense to work with labeled values.

wesm · 2016-08-15T07:47:01Z

@shoyer I'm totally with you here. This isn't explicitly called out in these docs but that was kind of what I intended here:

https://wesm.github.io/pandas2-design/internal-architecture.html#pandas-array-types

So basically every logical type (whether Int32Array, StringArray, CategoricalArray, etc.) would have a corresponding public Python API. These would also have __array__ for NumPy interoperability as able.

chrisaycock · 2016-08-15T14:10:45Z

doc/pandas-2.0/source/internal-architecture.rst

+Notably, this is the way that PostgreSQL handles null values. For example, we
+might have:
+
+.. code-block::


This code block is not showing-up in the generated HTML document. Is a language required?

Yep, apparently code-block does not default even to plain text if you don't indicate the language. I'll fix soon on my pass through to incorporate all the feedback here

jreback · 2016-08-16T13:23:57Z

I realigned the milestones with some 'approx' dates.

jorisvandenbossche · 2016-08-19T23:11:24Z

@wesm Thanks for the extensive explanations!

Two small things I wondered (have to read it further in more detail):

There has been discussion about the the datetime resolution (and resulting span) which is limited. I suppose with the better logical dtype system, it should be possible to tackle this? But maybe it is worth mentioning it somewhere?
Also regarding the dtype system, to what extent would it be possible to make defining custom dtypes externally work? (other libraries that provide a dtype that can be used in pandas dataframes)

jorisvandenbossche · 2016-08-19T23:19:28Z

It would be useful to start a separate document that is changes/improvements/refactorings that we would like to do in pandas 2.0 that do not involve changes to the internals (like, read_csv was brought up), how do you want to manage that?

Agreed.
Would it be worth to create a separate repo where those documents can live, together with the ones of this PR? (a pandas-design, pandas-discussion, pandas-dev, ...) This way, we can also use the issues/PRs over there to discuss the several topics without adding to the already huge number of open issues in the pandas repo, keeping it a bit separated.
But this has the disadvantage that when we start with actual code PRs, it will again be in the pandas repo.

Otherwise, I would just add documents describing those possible changes to this same pandas-2.0 directory.

wesm · 2016-08-20T01:02:27Z

@jorisvandenbossche I'd be happy to create a pandas-design repo — this has the added benefit of not bloating the main pandas git history if we check in images or other assets. Otherwise we can keep working in this doc/pandas-2.0 directory and indicate which docs are "internals-related" and which are "stuff that depends on the internals" (like the CSV reader, indexing, etc.)

wesm · 2016-08-23T03:51:55Z

I'm going to start incorporating the feedback into this document. I'll write a follow up PR for the Copy-On-Write discussion. I'll indicate in the document pages that are Internals related and those that are separate from Internals and maybe we can make more PRs digging into some of the other refactoring / cleanup we want to do.

Let's start discussing on the mailing list how we intend to proceed with two branches of pandas? I would suggest we should keep the pandas 2.0 branch "rebaseable" for as long as possible, but at some point there will be some necessary divergence.

jorisvandenbossche · 2016-08-23T06:33:02Z

What do others think of putting those documents and PRs in a separate repo?

wesm · 2016-08-23T13:28:34Z

I'm +0 on a separate repo (pydata/pandas-design or something), if only to make the discussion issues more accessible to outsiders (vs. having to seek them out in the main pandas issue tracker)

wesm · 2016-08-23T15:15:37Z

I'll go ahead and do that. We can always move the docs back here if the process is not working well.

I also would like more people to actively watch the design discussion. Having a dedicated repo will make the GitHub emails separate from the usual pandas firehose.

wesm · 2016-08-24T03:45:31Z

I've addressed many of the comments in here in wesm/pandas2#1 -- let's move the discussion there and see how it goes. I still need to fix up the github permissions so that all who have push rights here can push there

This is a revised version of the docs at pandas-dev/pandas#13944 Author: Wes McKinney <wesm@apache.org> Closes #1 from wesm/pandas-2.0-drafts and squashes the following commits: fa5ccbf [Wes McKinney] Typo with unsigned integers 30b25b0 [Wes McKinney] Note 608612b [Wes McKinney] Add top level note about WIP status 458ae85 [Wes McKinney] Incorporate some more feedback from github 801259d [Wes McKinney] Make goals section leaner / more concise per comments 94d0281 [Wes McKinney] Add section about logical to physical correspondence 7a44f8f [Wes McKinney] Add requirements.txt c0f080b [Wes McKinney] Move over design drafts from main pandas repo

wesm added 8 commits August 8, 2016 11:51

Add sphinx subproject for pandas2 designs

4bfbd3f

More faq

26bb013

Some exposition on missing data

ec953d2

Deploy to wesm/pandas2-design for now

2684160

Draft some string exposition

136ade9

Part of drafting logical type section

eda2cff

Section on numpy interoperability

c742d5d

Exposition on BlockManager / C++

c7819cf

gfyoung reviewed Aug 9, 2016
View reviewed changes

jreback added Docs API Design Compat pandas objects compatability with Numpy or Python functions labels Aug 9, 2016

TomAugspurger reviewed Aug 9, 2016
View reviewed changes

max-sixty reviewed Aug 9, 2016
View reviewed changes

jreback reviewed Aug 9, 2016
View reviewed changes

wesm mentioned this pull request Aug 9, 2016

PERF: Load data (create Series, Dataframe) in a more functional way. #5902

Closed

shoyer reviewed Aug 9, 2016
View reviewed changes

jreback reviewed Aug 10, 2016
View reviewed changes

agraboso mentioned this pull request Aug 11, 2016

Handling a CustomBusinessDay in time-based .rolling() #13969

Closed

chrisaycock reviewed Aug 15, 2016
View reviewed changes

wesm mentioned this pull request Aug 16, 2016

API/ENH: dtype='string' / pd.String #8640

Closed

jreback added this to the 2.0 milestone Aug 16, 2016

shoyer mentioned this pull request Aug 17, 2016

Index.unique() should always return an Index object of the same type #13395

Closed

wesm mentioned this pull request Aug 24, 2016

pandas 2.0 design drafts wesm/pandas2#1

Closed

wesm closed this Aug 24, 2016


		At a high level, the "pandas 2.0" effort is based on a number of observations:

		* The pandas 0.x series of releases have consisted with huge amounts of

Uh oh!

DOC: Design drafts to assist with next-gen pandas internals discussion #13944

DOC: Design drafts to assist with next-gen pandas internals discussion #13944

Uh oh!

Conversation

wesm commented Aug 9, 2016

Uh oh!

codecov-io commented Aug 9, 2016

Current coverage is 85.30% (diff: 100%)

Uh oh!

gfyoung commented Aug 9, 2016

Uh oh!

gfyoung commented Aug 9, 2016

Uh oh!

wesm commented Aug 9, 2016

Uh oh!

wesm commented Aug 9, 2016

Uh oh!

wesm commented Aug 9, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gfyoung Aug 9, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wesm Aug 9, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wesm Aug 9, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

TomAugspurger Aug 9, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jorisvandenbossche Aug 19, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chris-b1 commented Aug 9, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jreback Aug 9, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wesm commented Aug 9, 2016

Uh oh!

jreback Aug 9, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jreback commented Aug 9, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wesm commented Aug 9, 2016 •

edited

Loading

gfyoung Aug 9, 2016 •

edited

Loading

wesm Aug 9, 2016 •

edited

Loading

wesm Aug 9, 2016 •

edited

Loading

TomAugspurger Aug 9, 2016 •

edited

Loading

jorisvandenbossche Aug 19, 2016 •

edited

Loading

chris-b1 commented Aug 9, 2016 •

edited

Loading

jreback Aug 9, 2016 •

edited

Loading

jreback Aug 9, 2016 •

edited

Loading

wesm commented Aug 10, 2016 •

edited

Loading

shoyer commented Aug 13, 2016 •

edited

Loading

chrisaycock Aug 15, 2016 •

edited

Loading

jorisvandenbossche commented Aug 19, 2016 •

edited

Loading