Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add nullable column when missing. #687

Open
MikiGrit opened this issue Nov 23, 2021 · 2 comments
Open

Add nullable column when missing. #687

MikiGrit opened this issue Nov 23, 2021 · 2 comments
Labels
enhancement New feature or request

Comments

@MikiGrit
Copy link

I would like to use pandera to validate columns that are missing in the dataframe but are nullable, so they can be safely added. Would it be possible? Perhaps with some new config key like coerce_columns=True?

Why I'm trying to do this is that I'm parsing XML files and some fields may be missing. As the schema of the resulting dataframe is defined only in one place (pandera SchemaModel), I would like to be able to dynamically read everything from the files (e.g. with {xml_value.attrib['name']: xml_value.text for xml_value in xml_values.findall('value')}) and later add columns that are missing by the pandera validate check.

@MikiGrit MikiGrit added the enhancement New feature or request label Nov 23, 2021
@benlindsay
Copy link

Similar in spirit to my other issue (#706), but distinct. would love to see both of these implemented when someone gets time

@cosmicBboy
Copy link
Collaborator

Hi @MikiGrit thanks for articulating the feature request! gonna ping @aodj here too, who created a very similar issue (#893).

In short: yes! I give my blessing to support this feature 😀. The use case is clear and will provide value to a lot of other folks using pandera. A related issue is #502, which allows users to fill in default values in a column... this would take it to another level, filling in missing columns (potentially with a default value?)

Just a quick pre-amble: I've been doing a major overhaul of pandera to abstract out all the pandas-specific logic into its own set of modules/classes as part of #381, and I think this change is a good candidate for figuring out if the next-gen schema abstraction is easy to extend.

The working branch is here: https://github.com/unionai-oss/pandera/tree/core-schema

Solution Proposal

Add a new option to DataFrameSchema (and SchemaModel.Config) called add_missing_columns, which adds missing columns if True, and is `False by default.

In this first iteration of the feature, this option should only work with nullable columns, and will raise a SchemaError if it's not nullable. This restriction should be lifted once users can specify a default value #502.

schema = pa.DataFrameSchema({
    "col1": pa.Column(int),
    "col2": pa.Column(int, nullable=True),
})

data = pd.DataFrame({"col1": [1]})

validated_data = schema(data)
# validated_data should now have a "col2" column, which contains all null values.

Steps to Implement

The new pandera module structure consists of core modules and backend modules. This functionality would live in the backend modules, which implement the actual validation logic. See here for the strict_filter_columns implementation, which is invoked in the backends.pandas.container.DataFrameSchemaBackend.validate method.

  1. Add the add_missing_columns option to the core.pandas.container.DataFrameSchema class.
  2. This option will then be available in DataFrameSchemaBackend.validate, which receives the schema object
  3. A new DataFrameSchemaBackend method implements the add_missing_columns functionality and should be invoked in DataFrameSchemaBackend.validate.

@MikiGrit @aodj let me know if either of you have the capacity to implement this feature! I can help guide/answer any questions!

cosmicBboy added a commit that referenced this issue Jun 30, 2023
…1186)

* Add add_missing_columns DataFrame schema config per enhancement #687

Signed-off-by: Derin Walters <derin.c.walters@rijjin.com>

* Add checks to DataFrameSchema to throw a SchemaInitError if add_missing_columns is enabled and non-nullable columns without a default are added, per enhancement #687

Signed-off-by: Derin Walters <derin.c.walters@rijjin.com>

* Revert "Add checks to DataFrameSchema to throw a SchemaInitError if add_missing_columns is enabled and non-nullable columns without a default are added, per enhancement #687"

This reverts commit 2a0ef1c.

* Throw a SchemaError exception if add_missing_columns is enabled and missing non-nullable columns without a default are found, per enhancement #687

Signed-off-by: Derin Walters <derin.c.walters@rijjin.com>

* Fix bug in default value fill where first column with a default value fills the entire dataframe, even in unrelated columns, per issue #1193

Signed-off-by: Derin Walters <derin.c.walters@rijjin.com>

* fix lint

Signed-off-by: Niels Bantilan <niels.bantilan@gmail.com>

* add column coercion test

Signed-off-by: Niels Bantilan <niels.bantilan@gmail.com>

* add documentation

Signed-off-by: Niels Bantilan <niels.bantilan@gmail.com>

---------

Signed-off-by: Derin Walters <derin.c.walters@rijjin.com>
Signed-off-by: Niels Bantilan <niels.bantilan@gmail.com>
Co-authored-by: Niels Bantilan <niels.bantilan@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants