Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for type inference for DataFrames using the DataFrame Interchange Protocol #3114

Merged
merged 7 commits into from
Jul 24, 2023

Conversation

jonmmease
Copy link
Contributor

@jonmmease jonmmease commented Jul 19, 2023

Closes #3112

Approach

After discussion and experimentation in #3112, it looked to me like the best approach for generalizing our column type inference beyond pandas is to look up column names and types in the return value of the __dataframe__ protocol method. If we do this (rather than convert the input dataset to an arrow table and accessing the arrow schema), we leave it up the the implementation library whether accessing the column type information results in the full conversion of the entire dataset.

For example, @jcrist noted that today Ibis will eagerly perform a conversion to arrow when the __dataframe__ method is called. But if there's a motivating use case (like the functionality in this PR) then it would be possible to rework the architecture so that accessing only the column type info is cheap and doesn't trigger a query to the backend database.

Initially, I was hoping this approach wouldn't require the pyarrow library, and so we could replace our existing pandas-based approach. But that wasn't quite true. In order to extract the category values for an ordered categorical column, I needed to use a pyarrow column_to_array function to convert the low-level buffers and offsets of the protocol to a Python list of category values. because of this, I kept the pure-pandas implementation around as the fallback that is used when pyarrow is not available, or pandas is older the version 1.5 (When the __dataframe__ method was introduced).

Testing updates

I wanted to be able to run all of our tests with the fallback logic path (which runs when PyArrow is not installed or pandas is older than 1.5), so I updated various tests to skip instead of fail when pyarrow and vegafusion are not installed. Then I updated GitHub action build job to uninstall pyarrow and vegafusion when testing Python 3.9.

Interchange Types

The DataFrame interchange protocol is helpfully defined using typed abstract Python classes at https://data-apis.org/dataframe-protocol/latest/API.html. To improve the development experience, and help mypy, I copied the interface types into altair/utils/_dfi_types.py and updated out _DataFrameLike protocol type to return the proper DataFrame interface type. I don't know if we'll end up using the protocol directly in other places, but I think it's a pretty clean way to document, and type check, our usage of the protocol.

Example

First create a chart from a pandas DataFrame without encoding types specified. Note that we turn MPAA_Rating into an ordered categorical data type, and the category order is preserved in the bar chart.

import vegafusion as vf
import altair as alt
import polars as pl
from vega_datasets import data
import pandas as pd

# Clean Title column
movies = data.movies()
movies["Title"] = movies["Title"].astype(str)

# Convert MPAA rating to an ordered categorical
rating = movies["MPAA_Rating"].astype("category")
rating = rating.cat.reorder_categories(
    ['Open', 'G', 'PG', 'PG-13', 'R', 'NC-17', 'Not Rated']
).cat.as_ordered()
movies["MPAA_Rating"] = rating

# Build chart using pandas
chart = alt.Chart(movies).mark_bar().encode(
    alt.X("MPAA_Rating"),
    alt.Y("count()")
)
chart

visualization (3)

Now convert the input to arrow and make the same chart

# Build chart using Arrow (note 
chart = alt.Chart(pa.Table.from_pandas(movies)).mark_bar().encode(
    alt.X("MPAA_Rating"),
    alt.Y("count()")
)
chart

visualization (4)

This was an error previously, but now works just like pandas. And the category order is preserved the same way.

If the pyarrow data interchange module is available and the dataset has a __dataframe__ method, infer column types based on the DataFrame Interchange Protocol.

Fall back to pandas implementation for pandas DataFrames if pyarrow is not available or pandas is older that 1.5.
@jonmmease jonmmease marked this pull request as draft July 19, 2023 23:14
@jonmmease jonmmease requested a review from mattijn July 19, 2023 23:46
@jonmmease jonmmease marked this pull request as ready for review July 19, 2023 23:47
else:
raise ValueError(f"Unexpected DtypeKind: {kind}")


Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @jcrist for the parts of the DataFrame interchange protocol that I ended up using.

Copy link
Contributor

@mattijn mattijn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few comments @jonmmease!

Initially, I thought doing a parallel implementation, using the existing pandas column type-checking mechanism as the first choice, while exploring experimental support for other dataframe libraries using the dataframe interchange support protocol.
Current PR propose to change the column type-checking using pandas as fallback. If we do this here, should we do similar for the data serialization?

kind = column.dtype[0]
except NotImplementedError as e:
# Edge case hack:
# dtype access fails for pandas column with datetime64[ns] type,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there an issue at the pandas GitHub repository that we can refer to?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll take a look

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I opened one and added it to the comment in 5f4fc48.

See pandas-dev/pandas#54239

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. One double-check: it's just the dtype that cannot be retrieved in pandas-dev/pandas#54239. The data itself when using this dtype can be serialized correctly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, this same error occurs when converting to arrow through the DataFrame interchange protocol with:

from pyarrow.interchange import from_dataframe
from_dataframe(data)

Currently, we don't do this for pandas DataFrames because we use the pandas specific serialization logic. This is something we'll run into when trying to push pandas DataFrames through the DataFrame Interchange Protocol serialization logic. I'm not sure of the best path forward here, but I don't think it's blocking for this PR.

What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not blocking here. Raising upstream and circumventing here is fine.
How to approach this best in the data serialization logic is another issue/PR👍

and column.describe_categorical["is_ordered"]
and column.describe_categorical["categories"] is not None
):
# Treat ordered categorical column as Vega-Lite ordinal
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could introduce the pyarrow_available() check here instead of the existing proposed location, since it is only necessary for extracting the category values of an ordered categorical column?
Not sure if it is worth the speed-up for standard usage of dataframe-libraries by avoiding this pyarrow import here?
Or would this become difficult to go back to the fallback of pandas?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or would this become difficult to go back to the fallback of pandas?

Yeah, I used pyarrow_available() up above to make it easier to know whether to use the fallback logic. An alternative would be to raise a special exception if we get to this point and pyarrow is not available, and then use the fallback logic if that exception is raised. But this seemed more complicated than it was worth to me.

You do make a good point that this approach will result in pyarrow always being imported. This does introduce a small one-time delay. But on the other hand, if our future direction is to rely on pyarrow and the dataframe interchange protocol instead of pandas, then we'll end up importing pyarrow at some point.

@@ -28,6 +28,10 @@ jobs:
pip install .[dev]
# pip install "selenium<4.3.0"
# pip install altair_saver
- name: Maybe uninstall optional dependencies
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For future readers of this code, probably good to add some explanation or a reference to this Github issue why we do this for python 3.9.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 52cd0a8

@@ -436,7 +437,7 @@ def sanitize_arrow_table(pa_table):

def parse_shorthand(
shorthand: Union[Dict[str, Any], str],
data: Optional[pd.DataFrame] = None,
data: Optional[_DataFrameLike] = None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would this raise a type-checking error for pandas with versions prior 1.5?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure, but this a good callout. This type should be Union[_DataFrameLike, pd.DataFrame] anyway since it supports pandas DataFrame's that don't have __dataframe__ methods.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Signatures updated in c4a9b08

@jonmmease
Copy link
Contributor Author

Thanks for the review @mattijn!

Initially, I thought doing a parallel implementation, using the existing pandas column type-checking mechanism as the first choice, while exploring experimental support for other dataframe libraries using the dataframe interchange support protocol.
Current PR propose to change the column type-checking using pandas as fallback.

I initially started the implementation as you described (only using the dataframe interchange protocol approach if the input is a non-pandas DataFrame). This would work fine, but I realized that it would be challenging to test thoroughly because none of our existing tests would exercise the new logic. By flipping it around, we ensure that the new logic is thoroughly tested (as all of the existing tests now use the new logic). And by uninstalling pyarrow in one of the CI jobs, we ensure that the old logic is still thoroughly tested as well.

If we do this here, should we do similar for the data serialization?

Yeah, I think we could use this strategy for data serialization in the future as well. By routing pandas serialization through the new DataFrame Interchange Protocol logic we would ensure that the new logic is well tested and complete. And then we can make sure the fallback pandas logic keeps working by running the tests without pyarrow installed (as this PR does).

Copy link
Contributor

@mattijn mattijn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jonmmease, for addressing my questions! Can you add a note of this change in the release notes? Thanks again!

@jcrist
Copy link

jcrist commented Aug 1, 2023

Thanks y'all!

With some additional work in ibis I can now confirm that passing an ibis table to altair:

  • works, even with type inference
  • only executes the backing query once, rather than multiple times

A quick tiny notebook demoing things working: https://nbviewer.org/gist/jcrist/cd905e2181bca3137bbf74aaafd39d3f

@binste
Copy link
Contributor

binste commented Aug 1, 2023

As an ibis user, this is very exciting 🥳

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support for type inference of dataframes using the DataFrame Interchange Protocol
4 participants