Support for type inference for DataFrames using the DataFrame Interchange Protocol #3114

jonmmease · 2023-07-19T23:14:17Z

Approach

After discussion and experimentation in #3112, it looked to me like the best approach for generalizing our column type inference beyond pandas is to look up column names and types in the return value of the __dataframe__ protocol method. If we do this (rather than convert the input dataset to an arrow table and accessing the arrow schema), we leave it up the the implementation library whether accessing the column type information results in the full conversion of the entire dataset.

For example, @jcrist noted that today Ibis will eagerly perform a conversion to arrow when the __dataframe__ method is called. But if there's a motivating use case (like the functionality in this PR) then it would be possible to rework the architecture so that accessing only the column type info is cheap and doesn't trigger a query to the backend database.

Initially, I was hoping this approach wouldn't require the pyarrow library, and so we could replace our existing pandas-based approach. But that wasn't quite true. In order to extract the category values for an ordered categorical column, I needed to use a pyarrow column_to_array function to convert the low-level buffers and offsets of the protocol to a Python list of category values. because of this, I kept the pure-pandas implementation around as the fallback that is used when pyarrow is not available, or pandas is older the version 1.5 (When the __dataframe__ method was introduced).

Testing updates

I wanted to be able to run all of our tests with the fallback logic path (which runs when PyArrow is not installed or pandas is older than 1.5), so I updated various tests to skip instead of fail when pyarrow and vegafusion are not installed. Then I updated GitHub action build job to uninstall pyarrow and vegafusion when testing Python 3.9.

Interchange Types

The DataFrame interchange protocol is helpfully defined using typed abstract Python classes at https://data-apis.org/dataframe-protocol/latest/API.html. To improve the development experience, and help mypy, I copied the interface types into altair/utils/_dfi_types.py and updated out _DataFrameLike protocol type to return the proper DataFrame interface type. I don't know if we'll end up using the protocol directly in other places, but I think it's a pretty clean way to document, and type check, our usage of the protocol.

Example

First create a chart from a pandas DataFrame without encoding types specified. Note that we turn MPAA_Rating into an ordered categorical data type, and the category order is preserved in the bar chart.

import vegafusion as vf
import altair as alt
import polars as pl
from vega_datasets import data
import pandas as pd

# Clean Title column
movies = data.movies()
movies["Title"] = movies["Title"].astype(str)

# Convert MPAA rating to an ordered categorical
rating = movies["MPAA_Rating"].astype("category")
rating = rating.cat.reorder_categories(
    ['Open', 'G', 'PG', 'PG-13', 'R', 'NC-17', 'Not Rated']
).cat.as_ordered()
movies["MPAA_Rating"] = rating

# Build chart using pandas
chart = alt.Chart(movies).mark_bar().encode(
    alt.X("MPAA_Rating"),
    alt.Y("count()")
)
chart

Now convert the input to arrow and make the same chart

# Build chart using Arrow (note 
chart = alt.Chart(pa.Table.from_pandas(movies)).mark_bar().encode(
    alt.X("MPAA_Rating"),
    alt.Y("count()")
)
chart

This was an error previously, but now works just like pandas. And the category order is preserved the same way.

skip instead

If the pyarrow data interchange module is available and the dataset has a __dataframe__ method, infer column types based on the DataFrame Interchange Protocol. Fall back to pandas implementation for pandas DataFrames if pyarrow is not available or pandas is older that 1.5.

jonmmease · 2023-07-19T23:48:13Z

altair/utils/core.py

+    else:
+        raise ValueError(f"Unexpected DtypeKind: {kind}")
+
+


cc @jcrist for the parts of the DataFrame interchange protocol that I ended up using.

mattijn

Few comments @jonmmease!

Initially, I thought doing a parallel implementation, using the existing pandas column type-checking mechanism as the first choice, while exploring experimental support for other dataframe libraries using the dataframe interchange support protocol.
Current PR propose to change the column type-checking using pandas as fallback. If we do this here, should we do similar for the data serialization?

mattijn · 2023-07-21T19:21:44Z

altair/utils/core.py

+        kind = column.dtype[0]
+    except NotImplementedError as e:
+        # Edge case hack:
+        # dtype access fails for pandas column with datetime64[ns] type,


Is there an issue at the pandas GitHub repository that we can refer to?

I'll take a look

I opened one and added it to the comment in 5f4fc48.

See pandas-dev/pandas#54239

Thanks. One double-check: it's just the dtype that cannot be retrieved in pandas-dev/pandas#54239. The data itself when using this dtype can be serialized correctly.

Actually, this same error occurs when converting to arrow through the DataFrame interchange protocol with:

from pyarrow.interchange import from_dataframe from_dataframe(data)

Currently, we don't do this for pandas DataFrames because we use the pandas specific serialization logic. This is something we'll run into when trying to push pandas DataFrames through the DataFrame Interchange Protocol serialization logic. I'm not sure of the best path forward here, but I don't think it's blocking for this PR.

What do you think?

Not blocking here. Raising upstream and circumventing here is fine.
How to approach this best in the data serialization logic is another issue/PR👍

mattijn · 2023-07-21T19:34:04Z

altair/utils/core.py

+        and column.describe_categorical["is_ordered"]
+        and column.describe_categorical["categories"] is not None
+    ):
+        # Treat ordered categorical column as Vega-Lite ordinal


We could introduce the pyarrow_available() check here instead of the existing proposed location, since it is only necessary for extracting the category values of an ordered categorical column?
Not sure if it is worth the speed-up for standard usage of dataframe-libraries by avoiding this pyarrow import here?
Or would this become difficult to go back to the fallback of pandas?

Or would this become difficult to go back to the fallback of pandas?

Yeah, I used pyarrow_available() up above to make it easier to know whether to use the fallback logic. An alternative would be to raise a special exception if we get to this point and pyarrow is not available, and then use the fallback logic if that exception is raised. But this seemed more complicated than it was worth to me.

You do make a good point that this approach will result in pyarrow always being imported. This does introduce a small one-time delay. But on the other hand, if our future direction is to rely on pyarrow and the dataframe interchange protocol instead of pandas, then we'll end up importing pyarrow at some point.

mattijn · 2023-07-21T19:48:15Z

.github/workflows/build.yml

@@ -28,6 +28,10 @@ jobs:
          pip install .[dev]
          # pip install "selenium<4.3.0"
          # pip install altair_saver
+      - name: Maybe uninstall optional dependencies


For future readers of this code, probably good to add some explanation or a reference to this Github issue why we do this for python 3.9.

Done in 52cd0a8

mattijn · 2023-07-21T19:59:13Z

altair/utils/core.py

@@ -436,7 +437,7 @@ def sanitize_arrow_table(pa_table):

 def parse_shorthand(
    shorthand: Union[Dict[str, Any], str],
-    data: Optional[pd.DataFrame] = None,
+    data: Optional[_DataFrameLike] = None,


Would this raise a type-checking error for pandas with versions prior 1.5?

I'm not sure, but this a good callout. This type should be Union[_DataFrameLike, pd.DataFrame] anyway since it supports pandas DataFrame's that don't have __dataframe__ methods.

Signatures updated in c4a9b08

jonmmease · 2023-07-21T22:02:01Z

Thanks for the review @mattijn!

Initially, I thought doing a parallel implementation, using the existing pandas column type-checking mechanism as the first choice, while exploring experimental support for other dataframe libraries using the dataframe interchange support protocol.
Current PR propose to change the column type-checking using pandas as fallback.

I initially started the implementation as you described (only using the dataframe interchange protocol approach if the input is a non-pandas DataFrame). This would work fine, but I realized that it would be challenging to test thoroughly because none of our existing tests would exercise the new logic. By flipping it around, we ensure that the new logic is thoroughly tested (as all of the existing tests now use the new logic). And by uninstalling pyarrow in one of the CI jobs, we ensure that the old logic is still thoroughly tested as well.

If we do this here, should we do similar for the data serialization?

Yeah, I think we could use this strategy for data serialization in the future as well. By routing pandas serialization through the new DataFrame Interchange Protocol logic we would ensure that the new logic is well tested and complete. And then we can make sure the fallback pandas logic keeps working by running the tests without pyarrow installed (as this PR does).

mattijn

Thanks @jonmmease, for addressing my questions! Can you add a note of this change in the release notes? Thanks again!

jcrist · 2023-08-01T16:59:29Z

Thanks y'all!

With some additional work in ibis I can now confirm that passing an ibis table to altair:

works, even with type inference
only executes the backing query once, rather than multiple times

A quick tiny notebook demoing things working: https://nbviewer.org/gist/jcrist/cd905e2181bca3137bbf74aaafd39d3f

binste · 2023-08-01T17:48:04Z

As an ibis user, this is very exciting 🥳

jonmmease added 3 commits July 19, 2023 16:51

Add DataFrame Interchange Protocol types from spec for type checking

e205141

Tests shouldn't fail when pyarrow and vegafusion aren't installed

38da663

skip instead

jonmmease marked this pull request as draft July 19, 2023 23:14

jonmmease requested a review from mattijn July 19, 2023 23:46

jonmmease marked this pull request as ready for review July 19, 2023 23:47

jonmmease commented Jul 19, 2023

View reviewed changes

mattijn reviewed Jul 21, 2023

View reviewed changes

jonmmease added 2 commits July 21, 2023 17:44

Fix type signatures

c4a9b08

Add comment explaining why we uninstall optional dependencies

52cd0a8

Add pandas GitHub issue reference

5f4fc48

mattijn approved these changes Jul 24, 2023

View reviewed changes

Add changelog entry

87fe530

jonmmease merged commit 836b8cf into master Jul 24, 2023

jonmmease mentioned this pull request Jul 28, 2023

Pandas 2.0 with pyarrow backend: "TypeError: Cannot interpret 'timestamp[ms][pyarrow]' as a data type" #3127

Closed

binste mentioned this pull request Jul 31, 2023

Fix accidental requirement of Pandas 1.5. Bump minimum Pandas version to 0.25. Run tests with it #3130

Merged

jonmmease deleted the jonmmease/dip_scheam branch July 31, 2023 15:40

jcrist mentioned this pull request Aug 1, 2023

Can we use df.select_columns_by_name to subselect columns coming from __dataframe__ protocol #3134

Closed

joelostblom mentioned this pull request Aug 29, 2023

Object data types are no longer inferred from data frames #3177

Closed

jonmmease mentioned this pull request Feb 15, 2024

Push down column selections when using __dataframe__ protocol vega/vegafusion#386

Closed

mattijn mentioned this pull request Jun 25, 2024

wip-feat: pandas as soft dependency #3384

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for type inference for DataFrames using the DataFrame Interchange Protocol #3114

Support for type inference for DataFrames using the DataFrame Interchange Protocol #3114

jonmmease commented Jul 19, 2023 •

edited

Loading

jonmmease Jul 19, 2023

mattijn left a comment

mattijn Jul 21, 2023

jonmmease Jul 21, 2023

jonmmease Jul 24, 2023

mattijn Jul 24, 2023

jonmmease Jul 24, 2023

mattijn Jul 24, 2023

mattijn Jul 21, 2023

jonmmease Jul 21, 2023

mattijn Jul 21, 2023

jonmmease Jul 21, 2023

mattijn Jul 21, 2023

jonmmease Jul 21, 2023

jonmmease Jul 21, 2023

jonmmease commented Jul 21, 2023

mattijn left a comment

jcrist commented Aug 1, 2023

binste commented Aug 1, 2023

Support for type inference for DataFrames using the DataFrame Interchange Protocol #3114

Support for type inference for DataFrames using the DataFrame Interchange Protocol #3114

Conversation

jonmmease commented Jul 19, 2023 • edited Loading

Approach

Testing updates

Interchange Types

Example

Choose a reason for hiding this comment

mattijn left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jonmmease commented Jul 21, 2023

mattijn left a comment

Choose a reason for hiding this comment

jcrist commented Aug 1, 2023

binste commented Aug 1, 2023

jonmmease commented Jul 19, 2023 •

edited

Loading