-
-
Notifications
You must be signed in to change notification settings - Fork 309
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add nullable column when missing. #687
Comments
Similar in spirit to my other issue (#706), but distinct. would love to see both of these implemented when someone gets time |
Hi @MikiGrit thanks for articulating the feature request! gonna ping @aodj here too, who created a very similar issue (#893). In short: yes! I give my blessing to support this feature 😀. The use case is clear and will provide value to a lot of other folks using pandera. A related issue is #502, which allows users to fill in default values in a column... this would take it to another level, filling in missing columns (potentially with a default value?) Just a quick pre-amble: I've been doing a major overhaul of pandera to abstract out all the pandas-specific logic into its own set of modules/classes as part of #381, and I think this change is a good candidate for figuring out if the next-gen schema abstraction is easy to extend. The working branch is here: https://github.com/unionai-oss/pandera/tree/core-schema Solution ProposalAdd a new option to In this first iteration of the feature, this option should only work with nullable columns, and will raise a SchemaError if it's not nullable. This restriction should be lifted once users can specify a default value #502. schema = pa.DataFrameSchema({
"col1": pa.Column(int),
"col2": pa.Column(int, nullable=True),
})
data = pd.DataFrame({"col1": [1]})
validated_data = schema(data)
# validated_data should now have a "col2" column, which contains all null values. Steps to ImplementThe new pandera module structure consists of
@MikiGrit @aodj let me know if either of you have the capacity to implement this feature! I can help guide/answer any questions! |
…1186) * Add add_missing_columns DataFrame schema config per enhancement #687 Signed-off-by: Derin Walters <derin.c.walters@rijjin.com> * Add checks to DataFrameSchema to throw a SchemaInitError if add_missing_columns is enabled and non-nullable columns without a default are added, per enhancement #687 Signed-off-by: Derin Walters <derin.c.walters@rijjin.com> * Revert "Add checks to DataFrameSchema to throw a SchemaInitError if add_missing_columns is enabled and non-nullable columns without a default are added, per enhancement #687" This reverts commit 2a0ef1c. * Throw a SchemaError exception if add_missing_columns is enabled and missing non-nullable columns without a default are found, per enhancement #687 Signed-off-by: Derin Walters <derin.c.walters@rijjin.com> * Fix bug in default value fill where first column with a default value fills the entire dataframe, even in unrelated columns, per issue #1193 Signed-off-by: Derin Walters <derin.c.walters@rijjin.com> * fix lint Signed-off-by: Niels Bantilan <niels.bantilan@gmail.com> * add column coercion test Signed-off-by: Niels Bantilan <niels.bantilan@gmail.com> * add documentation Signed-off-by: Niels Bantilan <niels.bantilan@gmail.com> --------- Signed-off-by: Derin Walters <derin.c.walters@rijjin.com> Signed-off-by: Niels Bantilan <niels.bantilan@gmail.com> Co-authored-by: Niels Bantilan <niels.bantilan@gmail.com>
I would like to use pandera to validate columns that are missing in the dataframe but are nullable, so they can be safely added. Would it be possible? Perhaps with some new config key like
coerce_columns=True
?Why I'm trying to do this is that I'm parsing XML files and some fields may be missing. As the schema of the resulting dataframe is defined only in one place (pandera SchemaModel), I would like to be able to dynamically read everything from the files (e.g. with
{xml_value.attrib['name']: xml_value.text for xml_value in xml_values.findall('value')}
) and later add columns that are missing by the pandera validate check.The text was updated successfully, but these errors were encountered: