-
Notifications
You must be signed in to change notification settings - Fork 793
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for type inference for DataFrames using the DataFrame Interchange Protocol #3114
Conversation
If the pyarrow data interchange module is available and the dataset has a __dataframe__ method, infer column types based on the DataFrame Interchange Protocol. Fall back to pandas implementation for pandas DataFrames if pyarrow is not available or pandas is older that 1.5.
else: | ||
raise ValueError(f"Unexpected DtypeKind: {kind}") | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @jcrist for the parts of the DataFrame interchange protocol that I ended up using.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Few comments @jonmmease!
Initially, I thought doing a parallel implementation, using the existing pandas column type-checking mechanism as the first choice, while exploring experimental support for other dataframe libraries using the dataframe interchange support protocol.
Current PR propose to change the column type-checking using pandas as fallback. If we do this here, should we do similar for the data serialization?
altair/utils/core.py
Outdated
kind = column.dtype[0] | ||
except NotImplementedError as e: | ||
# Edge case hack: | ||
# dtype access fails for pandas column with datetime64[ns] type, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there an issue at the pandas GitHub repository that we can refer to?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll take a look
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I opened one and added it to the comment in 5f4fc48.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. One double-check: it's just the dtype
that cannot be retrieved in pandas-dev/pandas#54239. The data itself when using this dtype
can be serialized correctly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, this same error occurs when converting to arrow through the DataFrame interchange protocol with:
from pyarrow.interchange import from_dataframe
from_dataframe(data)
Currently, we don't do this for pandas DataFrames because we use the pandas specific serialization logic. This is something we'll run into when trying to push pandas DataFrames through the DataFrame Interchange Protocol serialization logic. I'm not sure of the best path forward here, but I don't think it's blocking for this PR.
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not blocking here. Raising upstream and circumventing here is fine.
How to approach this best in the data serialization logic is another issue/PR👍
and column.describe_categorical["is_ordered"] | ||
and column.describe_categorical["categories"] is not None | ||
): | ||
# Treat ordered categorical column as Vega-Lite ordinal |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could introduce the pyarrow_available()
check here instead of the existing proposed location, since it is only necessary for extracting the category values of an ordered categorical column?
Not sure if it is worth the speed-up for standard usage of dataframe-libraries by avoiding this pyarrow import here?
Or would this become difficult to go back to the fallback of pandas?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or would this become difficult to go back to the fallback of pandas?
Yeah, I used pyarrow_available()
up above to make it easier to know whether to use the fallback logic. An alternative would be to raise a special exception if we get to this point and pyarrow is not available, and then use the fallback logic if that exception is raised. But this seemed more complicated than it was worth to me.
You do make a good point that this approach will result in pyarrow always being imported. This does introduce a small one-time delay. But on the other hand, if our future direction is to rely on pyarrow and the dataframe interchange protocol instead of pandas, then we'll end up importing pyarrow at some point.
@@ -28,6 +28,10 @@ jobs: | |||
pip install .[dev] | |||
# pip install "selenium<4.3.0" | |||
# pip install altair_saver | |||
- name: Maybe uninstall optional dependencies |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For future readers of this code, probably good to add some explanation or a reference to this Github issue why we do this for python 3.9.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in 52cd0a8
altair/utils/core.py
Outdated
@@ -436,7 +437,7 @@ def sanitize_arrow_table(pa_table): | |||
|
|||
def parse_shorthand( | |||
shorthand: Union[Dict[str, Any], str], | |||
data: Optional[pd.DataFrame] = None, | |||
data: Optional[_DataFrameLike] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would this raise a type-checking error for pandas with versions prior 1.5?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure, but this a good callout. This type should be Union[_DataFrameLike, pd.DataFrame]
anyway since it supports pandas DataFrame's that don't have __dataframe__
methods.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Signatures updated in c4a9b08
Thanks for the review @mattijn!
I initially started the implementation as you described (only using the dataframe interchange protocol approach if the input is a non-pandas DataFrame). This would work fine, but I realized that it would be challenging to test thoroughly because none of our existing tests would exercise the new logic. By flipping it around, we ensure that the new logic is thoroughly tested (as all of the existing tests now use the new logic). And by uninstalling pyarrow in one of the CI jobs, we ensure that the old logic is still thoroughly tested as well.
Yeah, I think we could use this strategy for data serialization in the future as well. By routing pandas serialization through the new DataFrame Interchange Protocol logic we would ensure that the new logic is well tested and complete. And then we can make sure the fallback pandas logic keeps working by running the tests without pyarrow installed (as this PR does). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @jonmmease, for addressing my questions! Can you add a note of this change in the release notes? Thanks again!
Thanks y'all! With some additional work in ibis I can now confirm that passing an ibis table to altair:
A quick tiny notebook demoing things working: https://nbviewer.org/gist/jcrist/cd905e2181bca3137bbf74aaafd39d3f |
As an ibis user, this is very exciting 🥳 |
Closes #3112
Approach
After discussion and experimentation in #3112, it looked to me like the best approach for generalizing our column type inference beyond pandas is to look up column names and types in the return value of the
__dataframe__
protocol method. If we do this (rather than convert the input dataset to an arrow table and accessing the arrow schema), we leave it up the the implementation library whether accessing the column type information results in the full conversion of the entire dataset.For example, @jcrist noted that today Ibis will eagerly perform a conversion to arrow when the
__dataframe__
method is called. But if there's a motivating use case (like the functionality in this PR) then it would be possible to rework the architecture so that accessing only the column type info is cheap and doesn't trigger a query to the backend database.Initially, I was hoping this approach wouldn't require the
pyarrow
library, and so we could replace our existing pandas-based approach. But that wasn't quite true. In order to extract the category values for an ordered categorical column, I needed to use a pyarrowcolumn_to_array
function to convert the low-level buffers and offsets of the protocol to a Python list of category values. because of this, I kept the pure-pandas implementation around as the fallback that is used when pyarrow is not available, or pandas is older the version 1.5 (When the__dataframe__
method was introduced).Testing updates
I wanted to be able to run all of our tests with the fallback logic path (which runs when PyArrow is not installed or pandas is older than 1.5), so I updated various tests to skip instead of fail when
pyarrow
andvegafusion
are not installed. Then I updated GitHub actionbuild
job to uninstallpyarrow
andvegafusion
when testing Python 3.9.Interchange Types
The DataFrame interchange protocol is helpfully defined using typed abstract Python classes at https://data-apis.org/dataframe-protocol/latest/API.html. To improve the development experience, and help mypy, I copied the interface types into
altair/utils/_dfi_types.py
and updated out_DataFrameLike
protocol type to return the properDataFrame
interface type. I don't know if we'll end up using the protocol directly in other places, but I think it's a pretty clean way to document, and type check, our usage of the protocol.Example
First create a chart from a pandas DataFrame without encoding types specified. Note that we turn
MPAA_Rating
into an ordered categorical data type, and the category order is preserved in the bar chart.Now convert the input to arrow and make the same chart
This was an error previously, but now works just like pandas. And the category order is preserved the same way.