-
-
Notifications
You must be signed in to change notification settings - Fork 310
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pydantic/fastapi style validation #253
Comments
love this idea @jeffzi ! To be honest I'm not entirely familiar with Python's typing interface (though I am a user of the feature), and maybe you can answer a few basic questions I have :)
simple = pa.DataFrameSchema({"X" : pa.Column(pa.Int)})
@pa.typechecked
def transform_data(a: DataSet[simple.type], b: DataSet[simple.type]) -> DataSet[simple.type]:
a["X"] = a["X"] + b["X"]
return a
class Simple(BaseSchema):
X = pa.Column(pa.Int) One concern I have about supporting two APIs is that then there will be two (very different) ways of defining dataframe schemas, which may confuse users, esp new-comers, but I'm familiar enough with both pydantic and fastapi and maybe I do, however, love the IDE benefits from the proposed API and how nice would it be to potentially integrate dataframe schemas with fastapi!? All this said, I think this feature makes sense and I think it's good to adopt principled design paradigms that are widely adopted by the community (aka pydantic) and support various ways of doing things to ease adoption of
Having slept on this idea 🙂, I think we should fully-discuss the implications of this proposal, breaking it down into two separate pieces:
Just to give you a little background on the design principles of import pandas as pd
import pandera as pa
df = pd.DataFrame({
"col1": [1, 2, 3],
"col2": [1., 2., 3.],
})
schema = pa.DataFrameSchema({
"col1": pa.Column(pa.Int),
"col2": pa.Column(pa.Float),
}) Originally what I wanted to optimize is the developer experience for
On a separate note, I do think something like a |
I'm glad you like the idea :) One assumption I did not explicitly state is that mypy support should be mandatory. Not supporting it is most likely a deal-break for many users, rightful so. Some constructions are syntactically correct but mypy will throw an error because types don't match.
Questions regarding the implementation: i. We could check the type of the class variables. class Simple(BaseSchema):
column = pa.Column(pa.Int)
idx = pa.Index(pa.String, Check(lambda x: x.str.startswith("index_")))) ii. pydantic relies on Thanks for explaining the design decisions. I agree adopting the pydantic conventions wholesale makes more sense. To be honest I was testing the waters to see if you'd be interested.
It sounds like this feature would be a major endeavor. My 2 cents is that |
Thanks for the discussion @jeffzi, I just made #254 for the near-term solution to your problem. I think a Based on the points so far, it seems like the lack of type-hinting support for pandas is a big blocker. In the mean time, I would like to discuss with you what you think an ideal API would look like. Even if we don't come up with anything actionable right now, I would like to have these ideas written down for the record when we do revisit it for implementation down the line. I sort of see two different approaches: 1. Embracing the
|
Actually pydantic has constrained types to help with conciseness. The implementation is quite interesting: e.g constr Pandera could define similar types that leverage Checks behind the scene. That way, validators would be limited to non-straightforward checks. For those complicated checks, one-liner lambdas should be avoided and defined in their own functions, even with the current pandera api. from pandera import conint, constr
from pandera.typing import Series, Index
class MySchema(BaseSchema):
# these can use custom pa.typing.Column
int_column: Series[conint(gt=0, lt=100)]
#pydantic does not not have isin. It would do Literal["A, "B]
str_column: Series[constr(isin=["A", "B"])]
# a special class attribute for the index so as to avoid conflicts
# with columns named "index", can use custom pa.typing.Index
__schema_index__: Index[conint(gt=0)]
@column_validator("int_column", groupby="str_column")
def int_column_groupby_checks(cls, series_dict: Dict[str, pd.Series]):
return series_dict["A"].mean() > series_dict["B"].mean()
@dataframe_validator
def dataframe_checks(cls, df):
return "foo_column" not in df I've replaced pd.Series by pandera.typing.Series as it should be a type, not the real class. Similarly to Is |
Thanks for the additional discussion @jeffzi! I think we have a lot of great ideas and material here to act on at some point in the future.
It's a dictionary mapping discrete values from The constrained types are definitely something to look into for conciseness. Reading the pydantic docs more, it does seem like the from pandera.typing import BaseSchema, Series, Column
# slightly modified
class Simple(BaseSchema):
column: Series[np.int64] = Column(
..., # positional default
lt=100,
gt=0,
# additional kwargs will be named Checks
is_monotonic=pa.Check(lambda s: np.diff(s) > 0),
is_even=pa.Check(lambda x: x % 2 == 0, element_wise=True)
) I do wonder if the positional default makes sense for dataframes (i.e. a default value for an entire column if it's not provided in the |
To summarize what we have so far:
Open Questions
Implementation Draft (author @jeffzi, with edits from @cosmicBboy )import pandas as pd
import pandera as pa
# pandera.typing.base_schema module
class BaseSchema(pa.DataFrameSchema):
def __init__(self):
pass # collect class variables into columns/index
SchemaType = TypeVar("SchemaType", bound=BaseSchema)
# Q: is the `DataFrame` naming okay here? `DataSet` seems too general
class DataFrame(pd.DataFrame, Generic[SchemaType]):
pass
# user-facing api
from pandera.typing import DataFrame, Series, Index, ColumnField
class Schema(BaseSchema):
col1: Series[np.int64] = ColumnField(lt=100, gt=0)
col2: Series[str] = ColumnField(isin=["A", "B", "C"], nullable=True)
@pa.check_types
def transform_data(a: DataFrame[Schema], b: DataFrame[Schema]) -> DataFrame[Schema]:
a["col1"] = a["col1"] + b["col1"]
a["col2"] = a["col2"].combine_first(b["col2"])
return a I'm not sure what your appetite is for taking a shot at a prototype @jeffzi, but I'd be down to start working on something with you! |
Thanks for the summary.
I want to emphasize that we can't have static type-checking unless pandas is fully annotated (see pandas-dev/pandas#14468 and pandas-dev/pandas#26766). Even then, I think a mypy plugin or another tool would be necessary to cover all cases. For example: class MySchema(BaseSchema):
A: Series[str]
def transform(df: DataFrame[MySchema]) -> DataFrame[MySchema]:
# Static checker should infer that A dtype switched to int.
# It's harder because I'm using a string "A" to refer to the column.
df["A"] = 1
return df
data = pd.DataFrame({"A" : ["abc"]}) # static checker should infer that column A has string dtype
transform(data) It's hard for static checkers to follow the flow of data due to the dynamic nature of pandas.
At work, I've seen situations where we want to force an output schema regardless of missing columns in the input. In that case, we add the missing columns filled with NA values. For example, parsing json events where optional fields are not represented in the json structure but we still want the output csv to have those missing fields.
I'd say yes. The
Another thing to note: If you uses both syntaxes on the same attribute, pydantic verifies that the constraints are not in direct contradiction. https://pydantic-docs.helpmanual.io/usage/schema/#unenforced-field-constraints I'd be happy to put together a prototype. I'm fully expecting to discover new issues along the way but the process will be interesting. |
Hi,
This is certainly very complex and probably a long term goal. Just in case that you didn't see this yet and that it might help a tiny bit during development: this repo contains decent (yet still incomplete) pandas stub files: https://github.com/predictive-analytics-lab/data-science-types. Thanks for your effort, this is going to be awesome! |
@stephenkraemer Thanks for the link, I didn't know about those stubs. The stubs are certainly promising but Now, for the details. There are 2 ways to get annotations:
Postponed Evaluation of Annotations are "required" to use the The problem is that from __future__ import annotations # required
from typing import List, get_type_hints
import numpy as np
import pandas as pd
from reprexpy import SessionInfo
class ListContainer:
l: List[int]
print(f"{ListContainer.__annotations__}") # returns a string
#> {'l': 'List[int]'}
print(f"{get_type_hints(ListContainer)}") # returns a type
#> {'l': typing.List[int]}
class SeriesContainer:
s: pd.Series[np.int32]
print(f"{SeriesContainer.__annotations__}") # returns a string that we could parse or evaluate somehow :-(
#> {'s': 'pd.Series[np.int32]'}
print(f"{get_type_hints(SeriesContainer)}") # can't evaluate
#> Traceback (most recent call last):
#> <ipython-input-1-c27aafd86c14> in <module>
#> ----> 1 print(f"{get_type_hints(SeriesContainer)}") # can't evaluate
#> ~/.pyenv/versions/3.8.2/lib/python3.8/typing.py in get_type_hints(obj, globalns, localns)
#> 1227 if isinstance(value, str):
#> 1228 value = ForwardRef(value, is_argument=False)
#> -> 1229 value = _eval_type(value, base_globals, localns)
#> 1230 hints[name] = value
#> 1231 return hints
#> ~/.pyenv/versions/3.8.2/lib/python3.8/typing.py in _eval_type(t, globalns, localns)
#> 268 """
#> 269 if isinstance(t, ForwardRef):
#> --> 270 return t._evaluate(globalns, localns)
#> 271 if isinstance(t, _GenericAlias):
#> 272 ev_args = tuple(_eval_type(a, globalns, localns) for a in t.__args__)
#> ~/.pyenv/versions/3.8.2/lib/python3.8/typing.py in _evaluate(self, globalns, localns)
#> 516 localns = globalns
#> 517 self.__forward_value__ = _type_check(
#> --> 518 eval(self.__forward_code__, globalns, localns),
#> 519 "Forward references must evaluate to types.",
#> 520 is_argument=self.__forward_is_argument__)
#> <string> in <module>
#> TypeError: 'type' object is not subscriptable
SessionInfo()
#> Session info --------------------------------------------------------------------
#> Platform: Linux-5.8.6-1-MANJARO-x86_64-with-glibc2.29 (64-bit)
#> Python: 3.8
#> Date: 2020-09-14
#> Packages ------------------------------------------------------------------------
#> numpy==1.19.2
#> pandas==1.1.2
#> reprexpy==0.3.0 Created on 2020-09-14 by the reprexpy package |
@jeffzi FYI I'm slating this to be part of the 0.5.0 release, to go with dropping py3.5. |
Hey - I'm very interested in this feature and willing to help! Is there any research left to do with typing/annotation-wise or any way I can join development? |
great @amitripshtos! #258 is the first version of this functionality, which we'll be merging into project very soon, just ironing out some kinks in CI. @jeffzi is there anything you need help with? @amitripshtos I'm sure there will be more to discuss in terms of bugfixes/enhancements to the class-based/pydantic-style API, so we'd love feedback once |
Thanks @amitripshtos 🎉 I agree with @cosmicBboy. Once the feature is released (and documented !), it will be easier for you to review and suggest improvements. That being said, I summarized the feature in this comment. Feel free to have a look and give early feedback :) At the moment, I think annotations are in a good shape. Post |
@jeffzi Great job! Really amazing work :) As you can see, I took the input data frame and added the new field, and returned it. How do you think we should overcome this one? Making the annotation and the Generic typing semantic only? |
I was able to reproduce this behavior ^^ in pycharm. I'm guessing pycharm's type-checking implementation is different from mypy, as Defining |
@amitripshtos Thanks for reporting that issue. It's related to the concepts of If we have Therefore, we should declare I've tested that Pycharm assumes covariance by default and the above makes him happy. However, by default, mypy assumes that all user-defined generics are invariant. I did not mark class DataFrame(Generic[Schema]):
"""Representation of pandas.DataFrame.""" ^^ In that case, mypy throws Schema = TypeVar("Schema", bound="SchemaModel", covariant=True) # type: ignore ^^ That fixes the error even when |
@jeffzi Thanks for your response.
to:
However, Pycharm still doesn't like it. Am I missing something? :) EDIT:
However, it does not makes sense both logically and programmatically. |
I'm also finding that I think the issue with Not sure what the memory implications of this are, and it's a bit ugly, but wrapping import pandas as pd
from pandera import SchemaModel, check_types
from pandera.typing import DataFrame, Series
class InputSchema(SchemaModel):
name: Series[str]
age: Series[int]
class OutputSchema(SchemaModel):
real_age: Series[float]
def prediction(people: DataFrame[InputSchema]) -> DataFrame[OutputSchema]:
people["real_age"] = people["age"] / 2
return pd.DataFrame(people) Since at this stage of development the pandera types are really syntactic sugar to do run-time typechecking using the typing syntax, so as a last resort, I wonder if there's a way to automatically disable static type checking for the pandera types, since I anticipate more problems like this in various IDEs their various type-checking implementations. |
Sorry for the confusion. Actually, you don't need to modify the code of pandera. Pycharm assumes The code below should be accepted by Pycharm: from pandera import SchemaModel
from pandera.typing import Series, DataFrame
class OutputSchema(SchemaModel):
test: Series[float]
class InputSchema(OutputSchema):
name: Series[str]
age: Series[int]
@check_types
def prediction(people: DataFrame[InputSchema]) -> DataFrame[OutputSchema]:
people['real_age'] = people['age'] / 2
return people Since you add a column to class InputSchema("OutputSchema"):
name: Series[str]
age: Series[int]
class OutputSchema(SchemaModel):
test: Series[float] This doesn't work. My guess would be that it's an issue with Pycharm's type checker. Actually, I also tested with an example unrelated to pandera typing that Pycharm completely ignores |
Another really cool bug I found:
This code should fail because the output dataframe does not have the real_age field. The cool part is when I added a line:
It raised the right error. I think there is some kind of cache happening here that causing this issue (not quite sure). Edit:
Because we inherit from InputSchema, which I guess already has schema in runtime, we return the inherited schema. I think the solution is either to remove the condition that checks if cls__schema__ exists, or somehow understand we inherit. |
I agree with the last @cosmicBboy comment. If we find a way to disable the typings it will be good, otherwise the solution the documentation should in my opinion it either: class InputSchema(SchemaModel):
name: Series[str] = Field(allow_duplicates=False)
age: Series[int] = Field(ge=0)
class OutputSchema(InputSchema):
real_age: Series[float]
@check_types()
def prediction(a: int, people: DataFrame[InputSchema]) -> DataFrame[OutputSchema]:
people['real_age'] = people['age'] / 2
return pd.DataFrame(people) or: class InputSchema(SchemaModel):
name: Series[str] = Field(allow_duplicates=False)
age: Series[int] = Field(ge=0)
class OutputSchema(InputSchema):
real_age: Series[float]
@check_types()
def prediction(a: int, people: DataFrame[InputSchema]) -> DataFrame[OutputSchema]:
results = pd.DataFrame(people)
results['real_age'] = results['age'] / 2
return results |
@jeffzi @amitripshtos re: #253 (comment) I made a PR #282 that adds a test to the testing suite to catch this bug. The change to the source code is the naive fix, @jeffzi feel free to make changes to the PR, getting rid of the |
I wouldn't encourage users to risk a copy of the DataFrame.
We can leverage TYPE_CHECKING:
Pycharm is happy but cannot auto-complete regular pandas.DataFrame methods anymore... Vscode + Pylance has no issue at all. We can test major IDE's (pycharm, vscode, etc.) and checkers (mypy, pyright, etc.). From there we provide guidelines for each of the tested tools. Other solutions:
@check_types
def prediction(a: int, people: DataFrame[InputSchema]) -> DataFrame[OutputSchema]:
people["real_age"] = people["age"] / 2
# noinspection PyTypeChecker
return people
However, Pycharm does not recognize casting generics. If I take an example from mypy's documentation: from typing import cast, List
o: object = [1]
x = cast(List[int], o) Pycharm flags the last line with tl;dr: Imho, let's add the |
agreed. will add the guidelines part to the documentation task #281
haha, no regrets! we're going full steam ahead with this feature, I'm excited to try it out in my work :) thanks @amitripshtos appreciate the early testing, please keep the bug reports coming as you find them! |
Hi. First of all thanks for developing pandera :)
Is your feature request related to a problem? Please describe.
pandera can be very verbose for the common use case where we want to validate all inputs and outputs.
Describe the solution you'd like
I'd like to propose an alternative syntactic sugar inspired by pydantic and fastapi libraries.
This example is syntactically correct.
typechecked
would be a new decorator that grabs both input and return type hints and validates them if they are a subclass of theDataSet
type.I think this proposal is easier to read and would satisfy the majority of the use cases. A side benefit is that the IDE knows which columns are available and will offer auto-completion. The user can still fallback to
check_input
andcheck_output
when needed.Describe alternatives you've considered
An alternative would be a simpler decorator that checks both inputs and output with keywords argument for inputs and
out
argument for output.Additional context
I can prepare a PR of a prototype if you think this feature makes sense.
The text was updated successfully, but these errors were encountered: