Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pydantic/fastapi style validation #253

Closed
jeffzi opened this issue Jul 28, 2020 · 26 comments · Fixed by #300
Closed

pydantic/fastapi style validation #253

jeffzi opened this issue Jul 28, 2020 · 26 comments · Fixed by #300
Labels
discussion needed enhancement New feature or request
Milestone

Comments

@jeffzi
Copy link
Collaborator

jeffzi commented Jul 28, 2020

Hi. First of all thanks for developing pandera :)

Is your feature request related to a problem? Please describe.
pandera can be very verbose for the common use case where we want to validate all inputs and outputs.

import pandas as pd
import pandera as pa

simple = pa.DataFrameSchema({"X" : pa.Column(pa.Int)})

@pa.check_input(simple, "a")
@pa.check_input(simple, "b")
@pa.check_output(simple)
def transform_data(a: pd.DataFrame, b: pd.DataFrame) -> pd.DataFrame:
    a["X"] = a["X"] + b["X"]
    return a

Describe the solution you'd like

I'd like to propose an alternative syntactic sugar inspired by pydantic and fastapi libraries.

# new pandera api (non-working draft implementation)

class BaseSchema(pa.DataFrameSchema):
    def __init__(self):
        pass  # collect class variables into columns/index

SchemaType = TypeVar("SchemaType", bound=BaseSchema)

class DataSet(pd.DataFrame, Generic[SchemaType]):
    pass

# user-facing api

class Simple(BaseSchema):
    column = pa.Column(pa.Int)

@pa.typechecked
def transform_data(a: DataSet[Simple], b: DataSet[Simple]) -> DataSet[Simple]:
    a["X"] = a["X"] + b["X"]
    return a

This example is syntactically correct.

typechecked would be a new decorator that grabs both input and return type hints and validates them if they are a subclass of the DataSet type.

I think this proposal is easier to read and would satisfy the majority of the use cases. A side benefit is that the IDE knows which columns are available and will offer auto-completion. The user can still fallback to check_input and check_output when needed.

Describe alternatives you've considered

An alternative would be a simpler decorator that checks both inputs and output with keywords argument for inputs and out argument for output.

import pandas as pd
import pandera as pa

simple = pa.DataFrameSchema({"X" : pa.Column(pa.Int)})

@pa.validate(a = Simple, b = Simple, out = Simple)
def transform_data(a: pd.DataFrame, b: pd.DataFrame) -> pd.DataFrame:
    a["X"] = a["X"] + b["X"]
    return a

Additional context

I can prepare a PR of a prototype if you think this feature makes sense.

@jeffzi jeffzi added the enhancement New feature or request label Jul 28, 2020
@cosmicBboy
Copy link
Collaborator

cosmicBboy commented Jul 28, 2020

love this idea @jeffzi ! To be honest I'm not entirely familiar with Python's typing interface (though I am a user of the feature), and maybe you can answer a few basic questions I have :)

  1. Why is the DataSet class necessary? (i.e. using a pydantic model in fastapi would use Simple directly in the type sig)
  2. Would it be possible to convert a DataFrameSchema object to the type, so a user can re-use a schema defined via the object-based schema, like:
simple = pa.DataFrameSchema({"X" : pa.Column(pa.Int)})

@pa.typechecked
def transform_data(a: DataSet[simple.type], b: DataSet[simple.type]) -> DataSet[simple.type]:
    a["X"] = a["X"] + b["X"]
    return a
  1. Would it be too much to ask the type checker to also be aware of the data type of each column? E.g. if we try to element-wise add a string and an int column for the static type checker to complain? (I suspect his won't be possible)
  2. In the example you provided, shouldn't the class attribute be X to match the column referenced in the function below it?
class Simple(BaseSchema):
    X = pa.Column(pa.Int)

One concern I have about supporting two APIs is that then there will be two (very different) ways of defining dataframe schemas, which may confuse users, esp new-comers, but I'm familiar enough with both pydantic and fastapi and maybe pandera should get with the times and support two programming paradigms 🤔.

I do, however, love the IDE benefits from the proposed API and how nice would it be to potentially integrate dataframe schemas with fastapi!? All this said, I think this feature makes sense and I think it's good to adopt principled design paradigms that are widely adopted by the community (aka pydantic) and support various ways of doing things to ease adoption of pandera, tho I do want to discuss further to make sure I better understand the implications of this class-base API.

I think starting a typing subpackage under pandera would be a good place to start getting a prototype going. I'd suggest de-composing the problem into separate issues, as we probably have more to discuss underneath each one:

Having slept on this idea 🙂, I think we should fully-discuss the implications of this proposal, breaking it down into two separate pieces:

  1. implementing the BaseSchema type interface for the class-based API. Do you foresee this API using the existing object-based API to execute the run-time validation? I anticipate that this may be a fairly large under-taking, and we should probably discuss implementation details:
    1. how to specify the index component of the schema
    2. how to specify dataframe-level checks
    3. in pydantic, the semantics of the class variables are property: type = default whereas under this proposal type is missing, and the assignment is of a Column object, which contains the Checks. Would we also want to follow the validator classmethod decorator semantics in BaseSchema to implement Checks?
  2. the pa.typechecked decorator (would suggest pa.check_types to follow the check_* naming convention). I think the conciseness of the syntax in your proposal example is nice, but I think this decorator is contingent on a coherent design of point (1). If we can get that right, this should follow naturally.

Just to give you a little background on the design principles of pandera, I wanted defining dataframe schemas to "feel like" defining a dataframe.

import pandas as pd
import pandera as pa

df = pd.DataFrame({
    "col1": [1, 2, 3],
    "col2": [1., 2., 3.],
})
schema = pa.DataFrameSchema({
    "col1": pa.Column(pa.Int),
    "col2": pa.Column(pa.Float),
})

Originally what I wanted to optimize is the developer experience for pandas users. The proposal in this issue borrows a few syntactic concepts from pydantic, which I think optimizes the developer experience for python users (of which pandas users are a subset), but I think if we borrow concepts from it we should adopt them wholesale i.e. find a way to have a pydantic integration with pandera so that pydantic users coming into pandera won't have to learn a bunch of new pandera-specific concepts. What I'm trying to assess now is:

  1. is it a good idea for pandera to have two ways of defining schemas?
  2. what would a pydantic <-> pandera integration look like? (see here for some ideas)
  3. are there any other benefits of this class-based API in addition to IDE being aware of columns and the typing syntax in function defs?
  4. is the timing right to implement this class-based API? from my understanding one of pydantic's benefits from static type checking via mypy, which unfortunately won't be the case for pandera, since pandas data structures don't support type hinting yet: Typing Stubs and PEP 561 compatibility pandas-dev/pandas#28142, ENH: Add dtype-support for pandas' type-hinting pandas-dev/pandas#34248

On a separate note, I do think something like a pa.check_io decorator that you suggested as an alternative would be useful along the lines of the existing object-based API so that the user can check_io(x=in_schema, y=in_schema, out=out_schema, **validate_kwargs) for a more terse interface (I can make an issue for this).

@jeffzi
Copy link
Collaborator Author

jeffzi commented Jul 29, 2020

I'm glad you like the idea :)

One assumption I did not explicitly state is that mypy support should be mandatory. Not supporting it is most likely a deal-break for many users, rightful so. Some constructions are syntactically correct but mypy will throw an error because types don't match.

  1. Using Simple directly indeed works but the IDE doesn't know it's a DataFrame and you lose auto-completion. fastapi and pydantic use models as containers with few extra methods.

  2. mypy will reject DataSet[simple.type] because it won't be able to infer the type. Typevar is an annotation for static analysis.

  3. The main limitation here is pandas's lack of support for type hints (and would require a mypy plugin as well, probably?).

  4. Totally!

Questions regarding the implementation:

i. We could check the type of the class variables.

class Simple(BaseSchema):
    column = pa.Column(pa.Int)
    idx = pa.Index(pa.String, Check(lambda x: x.str.startswith("index_"))))

ii. pydantic relies on @validator decorator for fields and @root_validator to validate on the whole model. That would significantly complexify this proposal.
iii. I guess a nicer syntax would be column: pa.typing.Column[dtype] = "default value", where Column is a new type similar to the DataSet in my example. That would be much easier if pandas (and numpy) supported type hints such as Series[np.int64].

Thanks for explaining the design decisions. I agree adopting the pydantic conventions wholesale makes more sense. To be honest I was testing the waters to see if you'd be interested.

  1. If pandera supports 2 styles, users should be able to transition from one to another and not feel locked into a style. See https://pydantic-docs.helpmanual.io/usage/models/#dynamic-model-creation
  2. I'm not a fan of this particular notebook because checks are not vectorized. A pydantic model contains a single element (row) whereas a DataFrame is a collection of rows. Extra utilities such as exporting models are already baked in a DataFrame. For those reasons, I don't think it would be meaningful to integrate pandera and pydantic.
  3. It reduces boilerplate and adds auto-completion. Subjectively, it may be less daunting and favor adoption of pandera. I don't see other benefits.
  4. pydantic's benefit is mostly runtime validation. You could use the standard dataclass to achieve static validation. In my opinion, the main problem is that the type part in property: type = default should come from pandas itself. As I mentioned, you could add a custom pa.typing.Column that supports dtype Generic and swap it for the official implementation when pandas has it.

It sounds like this feature would be a major endeavor. My 2 cents is that pa.check_io makes more sense in the near future.

@cosmicBboy
Copy link
Collaborator

cosmicBboy commented Jul 30, 2020

Thanks for the discussion @jeffzi, I just made #254 for the near-term solution to your problem. I think a pydantic-style schema validation syntax would be quite valuable, seeing that that package has wide adoption and pandera would definitely benefit from the static type-checking this feature would offer.

Based on the points so far, it seems like the lack of type-hinting support for pandas is a big blocker. In the mean time, I would like to discuss with you what you think an ideal API would look like. Even if we don't come up with anything actionable right now, I would like to have these ideas written down for the record when we do revisit it for implementation down the line. I sort of see two different approaches:

1. Embracing the pydantic style completely

class MySchema(BaseSchema):
    # these can use custom pa.typing.Column
    int_column: pd.Series[np.int64]
    str_column: pd.Series[str]

    # a special class attribute for the index so as to avoid conflicts
    # with columns named "index", can use custom pa.typing.Index
    __schema_index__: pd.Index[pa.Int]

    @column_validator("int_column")
    def int_column_gt_0(cls, series: pd.Series):
        return series > 0

    @column_validator("int_column")
    def int_column_lt_100(cls, series: pd.Series):
        return series < 100

    @column_validator("str_column")
    def str_column_checks(cls, series: pd.Series):
        return series.isin(["A", "B"])

    @column_validator("int_column", groupby="str_column")
    def int_column_groupby_checks(cls, series_dict: Dict[str, pd.Series]):
        return series_dict["A"].mean() > series_dict["B"].mean()

    @index_validator
    def index_checks(cls, index: pd.Index):
        return index > 0

    @dataframe_validator
    def dataframe_checks(cls, df):
        return "foo_column" not in df

2. Adopting pydantic style for typing, then delegating run-time validation to the existing pandera inteface

class MySchema(BaseSchema):
    # these can use custom pa.typing.Column
    int_column: pd.Series[np.int64]
    str_column: pd.Series[str]

    # a special class attribute for the index so as to avoid conflicts
    # with columns named "index", can use custom pa.typing.Index
    __schema_index__: pd.Index[pa.Int]

    # under the hood, schema_checks creates an underlying
    # DataFrameSchema that BaseSchema uses for run-time validation
    validators = schema_checks(
        columns={
            "int_column": [
                pa.Check.greater_than(0),
                pa.Check.less_than(0),
                pa.Check(
                    lambda series_dict: series_dict["A"].mean() > series_dict["B"].mean(),
                    groupby="str_column"
                ),
            ],
           "str_column": pa.Check.isin(["A", "B"])
        },
        index=pa.Check.greater_than(0),
        # dataframe-level checks
        checks=pa.Check(lambda df: "foo_column" not in df)
    )

The pro of (1) is that the validator interface maps directly to the pydantic validators. However, I did find that as I was writing these out that for dataframe validation, writing one method per validator would lead to verbose schema definitions (I'm avoiding collapsing multiple validations into one method so that checks are atomic and identifiable for debuggability)

The pro of (2) is that the run-time validators are a lot more concise, however the con is that users coming in from pydantic would need to learn about Checks (which, to be fair, is pretty straight-forward as a concept)

@jeffzi
Copy link
Collaborator Author

jeffzi commented Aug 1, 2020

Actually pydantic has constrained types to help with conciseness. The implementation is quite interesting: e.g constr

Pandera could define similar types that leverage Checks behind the scene. That way, validators would be limited to non-straightforward checks. For those complicated checks, one-liner lambdas should be avoided and defined in their own functions, even with the current pandera api.

from pandera import conint, constr
from pandera.typing import Series, Index

class MySchema(BaseSchema):
    # these can use custom pa.typing.Column
    int_column: Series[conint(gt=0, lt=100)]
    #pydantic does not not have isin. It would do Literal["A, "B] 
    str_column: Series[constr(isin=["A", "B"])]

    # a special class attribute for the index so as to avoid conflicts
    # with columns named "index", can use custom pa.typing.Index
    __schema_index__: Index[conint(gt=0)]


    @column_validator("int_column", groupby="str_column")
    def int_column_groupby_checks(cls, series_dict: Dict[str, pd.Series]):
        return series_dict["A"].mean() > series_dict["B"].mean()

    @dataframe_validator
    def dataframe_checks(cls, df):
        return "foo_column" not in df

I've replaced pd.Series by pandera.typing.Series as it should be a type, not the real class. Similarly to typing.List.

Is series_dict in your example a DataFrame? If it's a dataframe, why do we need a groupby argument?

@cosmicBboy
Copy link
Collaborator

Thanks for the additional discussion @jeffzi! I think we have a lot of great ideas and material here to act on at some point in the future.

Is series_dict in your example a DataFrame? If it's a dataframe, why do we need a groupby argument?

It's a dictionary mapping discrete values from str_column to subsets of the int_column Series. It basically follows the behavior of the groupby keyword in the Check class.

The constrained types are definitely something to look into for conciseness. Reading the pydantic docs more, it does seem like the property: type = default can actually be property: type = Field(..., **kwargs) where the Field takes in additional validators. This actually does bring me back to your original proposal which makes a lot of sense to me now but something I missed before:

from pandera.typing import BaseSchema, Series, Column

# slightly modified
class Simple(BaseSchema):
    column: Series[np.int64] = Column(
        ...,  # positional default
        lt=100,
        gt=0,
        # additional kwargs will be named Checks
        is_monotonic=pa.Check(lambda s: np.diff(s) > 0),
        is_even=pa.Check(lambda x: x % 2 == 0, element_wise=True)
    )

I do wonder if the positional default makes sense for dataframes (i.e. a default value for an entire column if it's not provided in the Simple constructor), I'd be open to keeping it there, or even changing the semantics a little bit to fit the dataframe paradigm a little better, e.g. it could be a scalar that fills in null values, or a special functions that wrap ffill that imputes null values given other values in the column.

@cosmicBboy
Copy link
Collaborator

To summarize what we have so far:

  1. we want to support pydantic-style schema definitions for auto-complete, IDE support, static type-checking, and of course run-time validation.
  2. pandas doesn't have native support for typing its data structures, so something like Series[np.int64] would have to be implemented in pandera and eventually replaced by the native implementation whenever that happens.
  3. for run-time validators, users can specify:
    1. a typing.ColumnField class, similar to pydantic.Field to specify built-in validators.
    2. additional kwargs passed into typing.ColumnField for one-liner checks, e.g. Check(lambda x: x % 2 == 0, element_wise=True)
    3. a method decorated with column_validator similar to pydantic.validator but adopts the semantics of a Check.
    4. a method decorated with dataframe_validator to run checks on the entire dataframe

Open Questions

  • should typing.ColumnField take a positional argument for a default value or ... if required? i.e. does it make sense to have default columns in dataframes?
  • should the pandera implementations of Series, Index, and perhaps even DataFrame have the same name as their pandas counterparts?
  • typing.ColumnField seems to handle what constrained types achieves, at least with respect to conciseness... would something like Series[conint(lt=100)] still be desireable?

Implementation Draft (author @jeffzi, with edits from @cosmicBboy )

import pandas as pd
import pandera as pa

# pandera.typing.base_schema module
class BaseSchema(pa.DataFrameSchema):
    def __init__(self):
        pass  # collect class variables into columns/index

SchemaType = TypeVar("SchemaType", bound=BaseSchema)

# Q: is the `DataFrame` naming okay here? `DataSet` seems too general
class DataFrame(pd.DataFrame, Generic[SchemaType]):
    pass

# user-facing api
from pandera.typing import DataFrame, Series, Index, ColumnField

class Schema(BaseSchema):
    col1: Series[np.int64] = ColumnField(lt=100, gt=0)
    col2: Series[str] = ColumnField(isin=["A", "B", "C"], nullable=True)

@pa.check_types
def transform_data(a: DataFrame[Schema], b: DataFrame[Schema]) -> DataFrame[Schema]:
    a["col1"] = a["col1"] + b["col1"]
    a["col2"] = a["col2"].combine_first(b["col2"])
    return a

I'm not sure what your appetite is for taking a shot at a prototype @jeffzi, but I'd be down to start working on something with you!

@jeffzi
Copy link
Collaborator Author

jeffzi commented Aug 9, 2020

Thanks for the summary.

we want to support pydantic-style schema definitions for auto-complete, IDE support, static type-checking, and of course run-time validation.

I want to emphasize that we can't have static type-checking unless pandas is fully annotated (see pandas-dev/pandas#14468 and pandas-dev/pandas#26766). Even then, I think a mypy plugin or another tool would be necessary to cover all cases. For example:

class MySchema(BaseSchema):
    A: Series[str]

def transform(df: DataFrame[MySchema]) -> DataFrame[MySchema]:
    # Static checker should infer that A dtype switched to int. 
    # It's harder because I'm using a string "A" to refer to the column.
    df["A"] = 1  
    return df

data = pd.DataFrame({"A" : ["abc"]}) # static checker should infer that column A has string dtype
transform(data)

It's hard for static checkers to follow the flow of data due to the dynamic nature of pandas.

should typing.ColumnField take a positional argument for a default value or ... if required? i.e. does it make sense to have default columns in dataframes?

At work, I've seen situations where we want to force an output schema regardless of missing columns in the input. In that case, we add the missing columns filled with NA values. For example, parsing json events where optional fields are not represented in the json structure but we still want the output csv to have those missing fields.

should the pandera implementations of Series, Index, and perhaps even DataFrame have the same name as their pandas counterparts?

I'd say yes. The typing module uses the same names for types (capitalized). e.g List. It is also consistent with what pandera does with dtypes: https://pandera.readthedocs.io/en/stable/generated/pandera.PandasDtype.html#pandera.PandasDtype

typing.ColumnField seems to handle what constrained types achieves, at least with respect to conciseness... would something like Series[conint(lt=100)] still be desireable?

Series[conint(lt=100)] is more concise but not mandatory in the first pass I think. It's probably more useful for pydantic because of Iterables like List[conint(lt=100)]. I don't know how pydantic would do that with a Field. In our case, everything is iterable.

Another thing to note: If you uses both syntaxes on the same attribute, pydantic verifies that the constraints are not in direct contradiction. https://pydantic-docs.helpmanual.io/usage/schema/#unenforced-field-constraints

I'd be happy to put together a prototype. I'm fully expecting to discover new issues along the way but the process will be interesting.

@jeffzi jeffzi mentioned this issue Aug 12, 2020
7 tasks
@stephenkraemer
Copy link

Hi,
it's exiting to read through this discussion, this is really cool.

I want to emphasize that we can't have static type-checking unless pandas is fully annotated (see pandas-dev/pandas#14468 and pandas-dev/pandas#26766). Even then, I think a mypy plugin or another tool would be necessary to cover all cases.

This is certainly very complex and probably a long term goal. Just in case that you didn't see this yet and that it might help a tiny bit during development: this repo contains decent (yet still incomplete) pandas stub files: https://github.com/predictive-analytics-lab/data-science-types.

Thanks for your effort, this is going to be awesome!

@jeffzi
Copy link
Collaborator Author

jeffzi commented Sep 14, 2020

@stephenkraemer Thanks for the link, I didn't know about those stubs.

The stubs are certainly promising but typing information can't easily be extracted. Long story short it would require more investigation to support the stubs. I'd rather wait that the new class API is stable.

Now, for the details.

There are 2 ways to get annotations:

  • __annotations__
  • get_type_hints: Same as __annotations__ but resolves Forward References.

Postponed Evaluation of Annotations are "required" to use the data-type-stubs. This feature can be activated with from __future__ import annotations and will become the default in Python 3.10. Without postponed evaluation, you need string annotations like s: "pd.Series[float]" = pd.Series(...) 👎

The problem is that __annotations__ returns strings if postposted evaluation is actived and get_type_hints() can't resolve the types defined by the data-science-types stubs:

from __future__ import annotations # required
from typing import List, get_type_hints

import numpy as np
import pandas as pd
from reprexpy import SessionInfo

class ListContainer:
    l: List[int]

print(f"{ListContainer.__annotations__}") # returns a string
#> {'l': 'List[int]'}
print(f"{get_type_hints(ListContainer)}") # returns a type
#> {'l': typing.List[int]}

class SeriesContainer:
    s: pd.Series[np.int32]

print(f"{SeriesContainer.__annotations__}") # returns a string that we could parse or evaluate somehow :-(
#> {'s': 'pd.Series[np.int32]'}
print(f"{get_type_hints(SeriesContainer)}") # can't evaluate
#> Traceback (most recent call last):
#> <ipython-input-1-c27aafd86c14> in <module>
#> ----> 1 print(f"{get_type_hints(SeriesContainer)}") # can't evaluate
#> ~/.pyenv/versions/3.8.2/lib/python3.8/typing.py in get_type_hints(obj, globalns, localns)
#>    1227                 if isinstance(value, str):
#>    1228                     value = ForwardRef(value, is_argument=False)
#> -> 1229                 value = _eval_type(value, base_globals, localns)
#>    1230                 hints[name] = value
#>    1231         return hints
#> ~/.pyenv/versions/3.8.2/lib/python3.8/typing.py in _eval_type(t, globalns, localns)
#>     268     """
#>     269     if isinstance(t, ForwardRef):
#> --> 270         return t._evaluate(globalns, localns)
#>     271     if isinstance(t, _GenericAlias):
#>     272         ev_args = tuple(_eval_type(a, globalns, localns) for a in t.__args__)
#> ~/.pyenv/versions/3.8.2/lib/python3.8/typing.py in _evaluate(self, globalns, localns)
#>     516                 localns = globalns
#>     517             self.__forward_value__ = _type_check(
#> --> 518                 eval(self.__forward_code__, globalns, localns),
#>     519                 "Forward references must evaluate to types.",
#>     520                 is_argument=self.__forward_is_argument__)
#> <string> in <module>
#> TypeError: 'type' object is not subscriptable

SessionInfo()
#> Session info --------------------------------------------------------------------
#> Platform: Linux-5.8.6-1-MANJARO-x86_64-with-glibc2.29 (64-bit)
#> Python: 3.8
#> Date: 2020-09-14
#> Packages ------------------------------------------------------------------------
#> numpy==1.19.2
#> pandas==1.1.2
#> reprexpy==0.3.0

Created on 2020-09-14 by the reprexpy package

@jeffzi
Copy link
Collaborator Author

jeffzi commented Sep 14, 2020

Found more information in PEP 484:

Stub files are files containing type hints that are only for use by the type checker, not at runtime.

pytypes has a @typechecked decorator for runtime type checking with support for stubfiles though.

@cosmicBboy cosmicBboy added this to the 0.5.0 release milestone Sep 16, 2020
@cosmicBboy
Copy link
Collaborator

@jeffzi FYI I'm slating this to be part of the 0.5.0 release, to go with dropping py3.5.

@amitripshtos
Copy link
Contributor

Hey - I'm very interested in this feature and willing to help! Is there any research left to do with typing/annotation-wise or any way I can join development?

@cosmicBboy
Copy link
Collaborator

great @amitripshtos! #258 is the first version of this functionality, which we'll be merging into project very soon, just ironing out some kinks in CI. @jeffzi is there anything you need help with?

@amitripshtos I'm sure there will be more to discuss in terms of bugfixes/enhancements to the class-based/pydantic-style API, so we'd love feedback once 0.5.0 is released with this new API!

@jeffzi
Copy link
Collaborator Author

jeffzi commented Sep 27, 2020

Thanks @amitripshtos 🎉

I agree with @cosmicBboy. Once the feature is released (and documented !), it will be easier for you to review and suggest improvements. That being said, I summarized the feature in this comment. Feel free to have a look and give early feedback :)

At the moment, I think annotations are in a good shape. Post 0.5.0, I'd love to investigate compatibility with data-science-types stubs to ease adoption.

@amitripshtos
Copy link
Contributor

amitripshtos commented Oct 5, 2020

@jeffzi Great job! Really amazing work :)
I started playing with it a bit, and I came across some typing issue:
image

As you can see, I took the input data frame and added the new field, and returned it.
However, in Pycharm it raises because I did not returned DataFrame[OutputSchema], while the code is right.

How do you think we should overcome this one? Making the annotation and the Generic typing semantic only?
The only way to make it work is to create a new dataframe and copy everything, However, I don't think this solution is great.

@cosmicBboy
Copy link
Collaborator

cosmicBboy commented Oct 5, 2020

I was able to reproduce this behavior ^^ in pycharm. I'm guessing pycharm's type-checking implementation is different from mypy, as mypy file.py (contents of file.py being the code above) doesn't complain, seems like pycharm's is stricter, since it's strictly correct to say that the people variable of type DataFrame[InputSchema] is not of type DataFrame[OutputSchema].

Defining class OutputSchema(InputSchema) doesn't seem to lift the complaint either.

@jeffzi
Copy link
Collaborator Author

jeffzi commented Oct 5, 2020

@amitripshtos Thanks for reporting that issue.

It's related to the concepts of covariance and contravariance. A good summary can be found here and in mypy's documentation.

If we have class DerivedSchema(BaseSchema), we want DataFrame[DerivedSchema] to be accepted wherever DataFrame[BaseSchema] is, but not the opposite. The reason is that DerivedSchema can only add columns to BaseSchema.
➡️ That's covariance.

Therefore, we should declare class InputSchema(OutputSchema).

I've tested that Pycharm assumes covariance by default and the above makes him happy. However, by default, mypy assumes that all user-defined generics are invariant.

I did not mark pandera.typing.Schema as covariant though:
https://github.com/pandera-dev/pandera/blob/48059abcde5efd48d87062b9ec943cbd0fdb2541/pandera/typing.py#L56
It seems that mypy doesn't raise an error because pandera.typing.DataFrame inherits pandas.DataFrame.
https://github.com/pandera-dev/pandera/blob/48059abcde5efd48d87062b9ec943cbd0fdb2541/pandera/typing.py#L67
I tried the following:

class DataFrame(Generic[Schema]):
    """Representation of pandas.DataFrame."""

^^ In that case, mypy throws error: Incompatible return value type (got "DataFrame[InputSchema]", expected "DataFrame[OutputSchema]").

Schema = TypeVar("Schema", bound="SchemaModel", covariant=True)  # type: ignore

^^ That fixes the error even when pandas.DataFrame is not inherited. I'm not sure why inheriting pandas.DataFrame erases the error... but I think we should mark Schema as covariant just in case.

@amitripshtos
Copy link
Contributor

amitripshtos commented Oct 5, 2020

@jeffzi Thanks for your response.
I tried changing this line:

Schema = TypeVar("Schema", bound="SchemaModel")  # type: ignore

to:

Schema = TypeVar("Schema", bound="SchemaModel", covariant=True)  # type: ignore

However, Pycharm still doesn't like it. Am I missing something? :)

EDIT:
I now understand that you want to do something like that:

class OutputSchema("InputSchema"):
    real_age: Series[float]

    
class InputSchema(OutputSchema):
    name: Series[str] = Field(allow_duplicates=False)
    age: Series[int] = Field(ge=0)


@check_types()
def prediction(a: int, people: DataFrame[InputSchema]) -> DataFrame[OutputSchema]:
    people['real_age'] = people['age'] / 2
    return people

However, it does not makes sense both logically and programmatically.
I'm trying to think about solutions to this one, let me know if I'm wrong or you have a solution :)

@cosmicBboy
Copy link
Collaborator

cosmicBboy commented Oct 5, 2020

I'm also finding that covariant=True doesn't help with the pycharm type checking error.

I think the issue with class InputSchema(OutputSchema) is that if we were to use @check_types for runtime checks the input people dataframe won't have the test (/real_age) column.

Not sure what the memory implications of this are, and it's a bit ugly, but wrapping people in pd.DataFrame in the return statement seems to appease the pycharm type checker:

import pandas as pd
from pandera import SchemaModel, check_types
from pandera.typing import DataFrame, Series


class InputSchema(SchemaModel):
    name: Series[str]
    age: Series[int]


class OutputSchema(SchemaModel):
    real_age: Series[float]


def prediction(people: DataFrame[InputSchema]) -> DataFrame[OutputSchema]:
    people["real_age"] = people["age"] / 2
    return pd.DataFrame(people)

Since at this stage of development the pandera types are really syntactic sugar to do run-time typechecking using the typing syntax, so as a last resort, I wonder if there's a way to automatically disable static type checking for the pandera types, since I anticipate more problems like this in various IDEs their various type-checking implementations.

@jeffzi
Copy link
Collaborator Author

jeffzi commented Oct 5, 2020

Sorry for the confusion. Actually, you don't need to modify the code of pandera. Pycharm assumes covariant by default and mypy passes as well (somehow because we inherit pd.DataFrame). I was just saying that we should specify covariant=True to be explicit and avoid future issues.

The code below should be accepted by Pycharm:

from pandera import SchemaModel
from pandera.typing import Series, DataFrame


class OutputSchema(SchemaModel):
    test: Series[float]


class InputSchema(OutputSchema):
    name: Series[str]
    age: Series[int]

@check_types
def prediction(people: DataFrame[InputSchema]) -> DataFrame[OutputSchema]:
    people['real_age'] = people['age'] / 2
    return people

Since you add a column to people, InputSchema is a subtype of OutputSchema.

class InputSchema("OutputSchema"):
    name: Series[str]
    age: Series[int]


class OutputSchema(SchemaModel):
    test: Series[float]

This doesn't work. My guess would be that it's an issue with Pycharm's type checker. Actually, I also tested with an example unrelated to pandera typing that Pycharm completely ignores covariant=True and contravariant=True :(

@amitripshtos
Copy link
Contributor

amitripshtos commented Oct 5, 2020

Another really cool bug I found:

class InputSchema(SchemaModel):
    name: Series[str] = Field(allow_duplicates=False)
    age: Series[int] = Field(ge=0)


class OutputSchema(InputSchema):
    real_age: Series[float]


@check_types()
def prediction(a: int, people: DataFrame[InputSchema]) -> DataFrame[OutputSchema]:
    results = pd.DataFrame(people)
    return results

This code should fail because the output dataframe does not have the real_age field.
However it does not fail.

The cool part is when I added a line:

class InputSchema(SchemaModel):
    name: Series[str] = Field(allow_duplicates=False)
    age: Series[int] = Field(ge=0)


class OutputSchema(InputSchema):
    real_age: Series[float]

# I ADDED THIS PRINT
print(OutputSchema.to_schema())

@check_types()
def prediction(a: int, people: DataFrame[InputSchema]) -> DataFrame[OutputSchema]:
    results = pd.DataFrame(people)
    return results

It raised the right error. I think there is some kind of cache happening here that causing this issue (not quite sure).
Again - great work and I'm really thrilled using it! I will try to find as many issues as I can

Edit:
The issue is here:

    @classmethod
    def to_schema(cls) -> DataFrameSchema:
        """Create DataFrameSchema from the SchemaModel."""
        if cls.__schema__:
            return cls.__schema__

Because we inherit from InputSchema, which I guess already has schema in runtime, we return the inherited schema. I think the solution is either to remove the condition that checks if cls__schema__ exists, or somehow understand we inherit.

@amitripshtos
Copy link
Contributor

amitripshtos commented Oct 5, 2020

I agree with the last @cosmicBboy comment. If we find a way to disable the typings it will be good, otherwise the solution the documentation should in my opinion it either:

class InputSchema(SchemaModel):
    name: Series[str] = Field(allow_duplicates=False)
    age: Series[int] = Field(ge=0)


class OutputSchema(InputSchema):
    real_age: Series[float]


@check_types()
def prediction(a: int, people: DataFrame[InputSchema]) -> DataFrame[OutputSchema]:
    people['real_age'] = people['age'] / 2
    return pd.DataFrame(people)

or:

class InputSchema(SchemaModel):
    name: Series[str] = Field(allow_duplicates=False)
    age: Series[int] = Field(ge=0)


class OutputSchema(InputSchema):
    real_age: Series[float]


@check_types()
def prediction(a: int, people: DataFrame[InputSchema]) -> DataFrame[OutputSchema]:
    results = pd.DataFrame(people)
    results['real_age'] = results['age'] / 2
    return results

@cosmicBboy
Copy link
Collaborator

cosmicBboy commented Oct 5, 2020

@jeffzi @amitripshtos re: #253 (comment)

I made a PR #282 that adds a test to the testing suite to catch this bug. The change to the source code is the naive fix, @jeffzi feel free to make changes to the PR, getting rid of the __schema__ class attribute cache was just the quickest fix.

@jeffzi
Copy link
Collaborator Author

jeffzi commented Oct 5, 2020

Not sure what the memory implications of this are, and it's a bit ugly, but wrapping people in pd.DataFrame in the return statement seems to appease the pycharm type checker:

I wouldn't encourage users to risk a copy of the DataFrame.

I wonder if there's a way to automatically disable static type checking for the pandera types, since I anticipate more problems like this in various IDEs their various type-checking implementations.

We can leverage TYPE_CHECKING:

if TYPE_CHECKING:
    DataFrame = pd.DataFrame
else:
    class DataFrame(pd.DataFrame, Generic[Schema]):
        """Representation of pandas.DataFrame."""

Pycharm is happy but cannot auto-complete regular pandas.DataFrame methods anymore... Vscode + Pylance has no issue at all.

We can test major IDE's (pycharm, vscode, etc.) and checkers (mypy, pyright, etc.). From there we provide guidelines for each of the tested tools.

Other solutions:

  1. mark Schema type as covariant on pandera's side (make sure we conform to PEP484).
  2. disable the linter locally: # type: ignore for mypy. For pycharm, put a # noinspection PyTypeChecker above the line.
@check_types
def prediction(a: int, people: DataFrame[InputSchema]) -> DataFrame[OutputSchema]:
    people["real_age"] = people["age"] / 2
    # noinspection PyTypeChecker
    return people
  1. cast is the recommended solution when the type checker chokes on a subtype: cast(DataFrame[OutputSchema], people).

However, Pycharm does not recognize casting generics. If I take an example from mypy's documentation:

from typing import cast, List

o: object = [1]
x = cast(List[int], o) 

Pycharm flags the last line with Unexpected type(s): (GenericAlias, object) Possible types: (str, Any) (Type[_T], Any) but it's perfectly legit.

tl;dr: Imho, let's add the TYPE_CHECKING bypass and write guidelines for major tools.
@cosmicBboy I hope you don't regret embarking on this journey!
@amitripshtos Thanks for testing that early work, let us know if you find anything else :)

@amitripshtos
Copy link
Contributor

I can confirm that @jeffzi solution is working on Pycharm well.
I added a PR with the changes he proposed: #283

@cosmicBboy
Copy link
Collaborator

tl;dr: Imho, let's add the TYPE_CHECKING bypass and write guidelines for major tools.

agreed. will add the guidelines part to the documentation task #281

@cosmicBboy I hope you don't regret embarking on this journey!

haha, no regrets! we're going full steam ahead with this feature, I'm excited to try it out in my work :)

thanks @amitripshtos appreciate the early testing, please keep the bug reports coming as you find them!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discussion needed enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants