[ENH] Add Support for pd.Index, pd.Series, range to array type #3237

ChiaLingWeng · 2023-10-24T08:50:25Z

try to deal with #2808 and #2877
add

            for k, v in list(kwds.items()):
                if k not in (list(ignore) + ["shorthand"]):
                    if isinstance(v, (pd.Series, pd.Index)):
                        kwds[k] = v.to_list()
                    elif isinstance(v, range):
                        kwds[k] = list(v)
                else:
                    kwds.pop(k, None)

to convert to array type if it's pd.Series, pd.Index or range.
Not sure if it's right approach to convert inside to_dict() but not before (maybe in _todict?)

It can work on some test cases mentioned:

import altair as alt
from vega_datasets import data
import random
import pandas as pd
import numpy as np

n = 100
df = pd.DataFrame({"Category": [np.random.choice(["A","B","C","D"]) for i in range(n)],      
                   "Variable": [np.random.normal(0, 10) for i in range(n)]})

grouped = df.loc[:,['Category', 'Variable']] \
    .groupby(['Category']) \
    .median() \
    .sort_values(by='Variable').index

chart = alt.Chart(df).mark_boxplot().encode(
    x=alt.X("Category",sort=grouped),
    y='Variable'
)

chart

import altair as alt
import pandas as pd
from vega_datasets import data

barley = data.barley()

barley['variety'] = pd.Categorical(
    barley['variety'],
    ordered=True,
    categories=[
        'Manchuria',
         'No. 457',
         'No. 462',
         'No. 475',
         'Glabron',
         'Svansota',
         'Velvet',
         'Trebi',
         'Wisconsin No. 38',
         'Peatland'
    ]
)

chart = alt.Chart(barley).mark_bar().encode(
    x=alt.X('variety', sort=barley['variety'].cat.categories),  # This line needs manual conversion
    y=alt.Y('sum(yield)'),
    color='site:N'
)

chart

joelostblom · 2023-10-25T01:55:43Z

Thanks for working on this @ChiaLingWeng ! And for all you contributions recently! This will be a convenient addition.

I am not sure where would be the best place for this logic, and don't have time to dive deeper myself right now, but others might have thoughts on this.

Regarding the logic, I wonder if we can make it more general to support any iterable (anything that can be converted into a list). This would allow us to support other dataframe libraries to as well as tuples, numpy arrays, etc, without special tests for each. I think all values that altair can take pass to altair is either a string (column names), dictionary (alt.value, alt.condition, etc), or class (alt.Color, alt.X, etc). So maybe we can just do something like:

if ~isinstance(v, str | dict | type):  # type is for class
    list(v)  # this works with any iterable, which includes series, indexes, arrays, etc

This throws an informative error too pointing out that the object is not an iterable. There might be some grave oversight here because I haven't tried things out, but the general idea of making this solution more general would be helpful I think.

ChiaLingWeng · 2023-10-27T06:22:58Z

@joelostblom Thanks for your suggestion!
I will try to find if there's more general change for this.

mattijn · 2023-11-05T09:06:52Z

Would

elif hasattr(v, '__iter__'):
    kwds[k] = list(v)

be sufficient? Or is this too loose?

binste · 2023-11-05T09:20:35Z

list takes any iterable: https://docs.python.org/3/library/stdtypes.html#list so I think we could do what @mattijn suggested. Instead of the hasattr check, it's more idiomatic to use the typing.Iterable protocol in an isinstance check:

joelostblom · 2023-11-06T02:03:24Z

The issue with both of these approaches would be that they catch strings as well, which is why I went for explicitly stating the types not to iterate. But maybe checking if the object is an iterable and not a string would work well?

mattijn · 2023-11-06T20:56:18Z

Good points @joelostblom and @binste! My first thought that it was OK to have a str trueing as Iterable, but since altair accepts different type of inputs for a single argument, this can likely lead to interference. Example:

import altair as alt
import pandas as pd

data = {'x': ['a', 'b', 'c', 'd', 'e', 'f'], 'y': [5, 3, 8, 4, 6, 2]}
df = pd.DataFrame(data)

alt.Chart(df).mark_bar().encode(
    x=alt.X('x', sort=list('cdfabe')),  # `sort='cdfabe'` can be come ['c', 'd', 'f', 'a', 'b', 'e']
    y='y'
)

alt.Chart(df).mark_bar().encode(
    x=alt.X('x', sort='y'),  # `sort='y'` should not become ['y']
    y='y'
)

Probably something you already realised, @joelostblom.
Is it a good idea to define a location with our defined -permissive, but constrained- input types? There we could define types such as ListLike (but not str) and maybe also others such as DataFrameLike that can be accessed internally but also used as part of the public API?

joelostblom · 2023-11-07T02:39:37Z

Yes, exactly, that's the reason we need to avoid catching strings in this check. I would be in favor of developing our own types if there is a clear benefit over writing something like not isinstance(v, str) and isinstance(v, Iterable). I know numpy has an ArrayLike type, but that incluced strings and can't be used with isinstance in general (not sure if that would be the same if we developed our own ListLike type). For DataFrameLike I thought what we are doing currently it was enough to check for the dataframe exchange protocol dunder string? Unless you mean for the typing hints specifically, then I see the convenience of having ListLike (or ArrayLike and DataFrameLike

ChiaLingWeng · 2023-11-07T06:24:51Z

Not sure if this is strict enough, but from this discussion, I found pd.api.types.is_list_like() may be useful.

mattijn · 2023-11-11T13:30:57Z

Thanks @ChiaLingWeng! That seems to be the behaviour we are after here as well, but then without depending on pandas itself. Maybe we can use the same implementation as adopted within pandas, but to me its not really obvious how this function is implemented (I only can find this).

@joelostblom, yes I was thinking to define these as a pubic type definitions that also can be used by users prior pushing their data into Altair.

joelostblom · 2023-11-11T17:12:47Z

Nice find @ChiaLingWeng ! I agree with @mattijn that we if we could re-implement this without having to import pandas that would be great (since it would help us make pandas an optional dependency for altair in the future).

I believe the actual pandas implementation is here because it is a Cython function. It seems to me that they use a similar approach to what we discussed here; essentially checking if it is an Iterable and not a string, so something like:

from typing import Iterable

if not isinstance(v, str | dict | type | bytes) and isinstance(v, Iterable):  # type is for class
    v = list(v)

We could also do a try/catch with list(v) right away instead of the explicit Iterable check , but maybe it is clearer with the explicit check as per the above. I don't know if we need any additional error handling; I think implementing this and then running the rest suite to see if something breaks would be the next step.

joelostblom · 2024-07-17T08:07:29Z

@ChiaLingWeng I'm just checking in if you're interested in continue working on this PR. It doesn't have to be right now, but let us know if you're planning to take it up again in the future.

ChiaLingWeng · 2024-07-19T00:26:31Z

@joelostblom It's been a while since I checked on this, but yes, I'm willing to work on this PR. Let me know if you have any suggestions.

dangotbanned · 2024-07-19T09:15:39Z

@ChiaLingWeng

FYI this PR will be the main source of conflicts, so be sure to check it out to see how to handle pandas

feat: make pandas and NumPy optional dependencies, don't require PyArrow for plotting with Polars/Modin/cuDF #3452

joelostblom · 2024-07-24T08:55:24Z

Thanks @ChiaLingWeng! As @dangotbanned mentioned, altair no longer requires pandas, so we will need to make sure that the implementation here is not using any pd. import and instead has a more general solution. I think reimplementing what pd.api.types.is_list_like() is doing in Python would be the first thing to try, along the line of the example I mention in #3237 (comment), but we can probably refine it further to:

from typing import Iterable

if isinstance(v, Iterable) and not isinstance(v, str | dict):
    v = list(v)

This means that all iterable objects that can be converted to list should be, except if a string (column name) or dict (alternative helper class definition) is passed. Implementing this and then running the test suite to see if anything fails would be the first step. After that, we can consider if we want to make this more specific and only do the conversions for keywords that accept iterables, such as sort, values, etc, but we may not need to.

dangotbanned · 2024-07-24T10:56:27Z

narwhals.from_native might have some of the behaviour needed for this, once you've narrowed past stdlib types.

joelostblom · 2024-07-24T12:36:18Z

Interesting, we would not only want it to work for data frame / series-like objects though, but any iterable, including tuples, ranges, arrays, etc.

dangotbanned · 2024-07-24T13:01:19Z

narwhals.from_native might have some of the behaviour needed for this, once you've narrowed past stdlib types.
@dangotbanned

Interesting, we would not only want it to work for data frame / series-like objects though, but any iterable, including tuples, ranges, arrays, etc.
@joelostblom

Apologies I think the language I used wasn't super helpful.

So for https://github.com/ChiaLingWeng/altair/blob/1f86ae328711c8ebb5a0d5553de3ad77cd8a6625/tools/schemapi/schemapi.py#L960

for k, v in list(kwds.items()):
    if k not in (list(ignore) + ["shorthand"]):
        if isinstance(v, (pd.Series, pd.Index)): # <--- this part
            kwds[k] = v.to_list()
        elif isinstance(v, range):
            kwds[k] = list(v)
    else:
        kwds.pop(k, None)

I also wrongly assumed the changes were to be made in _to_dict and not SchemaBase.to_dict:

altair/tools/schemapi/schemapi.py

Lines 487 to 516 in 781c507

    
           def _todict(obj: Any, context: dict[str, Any] | None, np_opt: Any, pd_opt: Any) -> Any: 
        
               """Convert an object to a dict representation.""" 
        
               if np_opt is not None: 
        
                   np = np_opt 
        
                   if isinstance(obj, np.ndarray): 
        
                       return [_todict(v, context, np_opt, pd_opt) for v in obj] 
        
                   elif isinstance(obj, np.number): 
        
                       return float(obj) 
        
                   elif isinstance(obj, np.datetime64): 
        
                       result = str(obj) 
        
                       if "T" not in result: 
        
                           # See https://github.com/vega/altair/issues/1027 for why this is necessary. 
        
                           result += "T00:00:00" 
        
                       return result 
        
               if isinstance(obj, SchemaBase): 
        
                   return obj.to_dict(validate=False, context=context) 
        
               elif isinstance(obj, (list, tuple)): 
        
                   return [_todict(v, context, np_opt, pd_opt) for v in obj] 
        
               elif isinstance(obj, dict): 
        
                   return { 
        
                       k: _todict(v, context, np_opt, pd_opt) 
        
                       for k, v in obj.items() 
        
                       if v is not Undefined 
        
                   } 
        
               elif hasattr(obj, "to_dict"): 
        
                   return obj.to_dict() 
        
               elif pd_opt is not None and isinstance(obj, pd_opt.Timestamp): 
        
                   return pd_opt.Timestamp(obj).isoformat() 
        
               else: 
        
                   return obj

joelostblom · 2024-07-30T07:25:53Z

@ChiaLingWeng i think we have a good solution to this issue in #3501, so I will close this PR as superseded by that one. Thank you again for your work and contributions here.

ChiaLingWeng added 3 commits October 24, 2023 16:43

[ENH] Add Support for pd.Index and pd.Series to array type

9065623

Add support for pd.Index to_list()

defb839

Add range to list support

1f86ae3

ChiaLingWeng changed the title ~~[ENH] Add Support for pd.Index and pd.Series to array type~~ [ENH] Add Support for pd.Index, pd.Series, range to array type Oct 25, 2023

mattijn mentioned this pull request Nov 5, 2023

Type hints: Infer types from vegalite schema for autogenerated code #3208

Merged

dangotbanned mentioned this pull request Jul 24, 2024

feat: Support a wider range of iterables in SchemaBase.to_dict #3501

Merged

1 task

joelostblom closed this Jul 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ENH] Add Support for pd.Index, pd.Series, range to array type #3237

[ENH] Add Support for pd.Index, pd.Series, range to array type #3237

ChiaLingWeng commented Oct 24, 2023 •

edited

Loading

joelostblom commented Oct 25, 2023

ChiaLingWeng commented Oct 27, 2023 •

edited

Loading

mattijn commented Nov 5, 2023

binste commented Nov 5, 2023

joelostblom commented Nov 6, 2023 •

edited

Loading

mattijn commented Nov 6, 2023

joelostblom commented Nov 7, 2023

ChiaLingWeng commented Nov 7, 2023

mattijn commented Nov 11, 2023 •

edited

Loading

joelostblom commented Nov 11, 2023 •

edited

Loading

joelostblom commented Jul 17, 2024

ChiaLingWeng commented Jul 19, 2024

dangotbanned commented Jul 19, 2024

joelostblom commented Jul 24, 2024

dangotbanned commented Jul 24, 2024

joelostblom commented Jul 24, 2024

dangotbanned commented Jul 24, 2024

joelostblom commented Jul 30, 2024

[ENH] Add Support for pd.Index, pd.Series, range to array type #3237

[ENH] Add Support for pd.Index, pd.Series, range to array type #3237

Conversation

ChiaLingWeng commented Oct 24, 2023 • edited Loading

joelostblom commented Oct 25, 2023

ChiaLingWeng commented Oct 27, 2023 • edited Loading

mattijn commented Nov 5, 2023

binste commented Nov 5, 2023

joelostblom commented Nov 6, 2023 • edited Loading

mattijn commented Nov 6, 2023

joelostblom commented Nov 7, 2023

ChiaLingWeng commented Nov 7, 2023

mattijn commented Nov 11, 2023 • edited Loading

joelostblom commented Nov 11, 2023 • edited Loading

joelostblom commented Jul 17, 2024

ChiaLingWeng commented Jul 19, 2024

dangotbanned commented Jul 19, 2024

joelostblom commented Jul 24, 2024

dangotbanned commented Jul 24, 2024

joelostblom commented Jul 24, 2024

dangotbanned commented Jul 24, 2024

joelostblom commented Jul 30, 2024

ChiaLingWeng commented Oct 24, 2023 •

edited

Loading

ChiaLingWeng commented Oct 27, 2023 •

edited

Loading

joelostblom commented Nov 6, 2023 •

edited

Loading

mattijn commented Nov 11, 2023 •

edited

Loading

joelostblom commented Nov 11, 2023 •

edited

Loading