Skip to content

SQL execution Exception executing validations, using SQL and Pandas data sources: "Duplicated field name in view schema: column_values.nonnull.unexpected_count" #10926

Open
@bluinchiostro

Description

Bug description
SQL execution Exception executing validations, using SQL and Pandas data sources.

I created a File Data Context, connected to SQL data, filesystem data and Pandas Dataframes. I defined my expectation suite, created a validation definition and run it. While executing

validation_results = validation_definition.run() with SQL and Pandas datasource, I'm facing with this error:

Calculating Metrics:  86%|████████▌ | 24/28 [00:01<00:00, 20.39it/s]An SQL execution Exception occurred.  OperationalError: "(psycopg2.OperationalError) **Duplicated field name in view schema: column_values.nonnull.unexpected_count**
DETAIL:  java.sql.SQLException: Duplicated field name in view schema: column_values.nonnull.unexpected_count

[SQL: SELECT sum(CASE WHEN (field1 IS NULL) THEN %(param_1)s ELSE %(param_2)s END) AS "column_values.nonnull.unexpected_count", sum(CASE WHEN (field2 IS NULL) THEN %(param_3)s ELSE %(param_4)s END) AS "column_values.nonnull.unexpected_count", sum(CASE WHEN (field3 IS NULL) THEN %(param_5)s ELSE %(param_6)s END) AS "column_values.nonnull.unexpected_count", sum(CASE WHEN (field4 IS NULL) THEN %(param_7)s ELSE %(param_8)s END) AS "column_values.nonnull.unexpected_count" 
FROM (SELECT * 
FROM (SELECT * from my_table_data) AS anon_1 
WHERE true) AS anon_1]
[parameters: {'param_1': 1, 'param_2': 0, 'param_3': 1, 'param_4': 0, 'param_5': 1, 'param_6': 0, 'param_7': 1, 'param_8': 0}]
(Background on this error at: https://sqlalche.me/e/14/e3q8)".  Traceback: "Traceback (most recent call last):
  File "d:\Veronica\Anaconda\envs\gxenv\lib\site-packages\sqlalchemy\engine\base.py", line 1900, in _execute_context
    self.dialect.do_execute(
  File "d:\Veronica\Anaconda\envs\gxenv\lib\site-packages\sqlalchemy\engine\default.py", line 736, in do_execute
    cursor.execute(statement, parameters)
psycopg2.OperationalError: Duplicated field name in view schema: column_values.nonnull.unexpected_count
DETAIL:  java.sql.SQLException: Duplicated field name in view schema: column_values.nonnull.unexpected_count

The dataframe hasn't duplicated field names.

To Reproduce
great_expectations.yml config:

config_version: 4.0
config_variables_file_path: uncommitted/config_variables.yml

plugins_directory: plugins/

stores:

  expectations_store:
    class_name: ExpectationsStore
    store_backend:
      class_name: TupleFilesystemStoreBackend
      base_directory: expectations/
  validation_results_store:
    class_name: ValidationResultsStore
    store_backend:
      class_name: TupleFilesystemStoreBackend
      base_directory: uncommitted/validations/
  checkpoint_store:
    class_name: CheckpointStore
    store_backend:
      class_name: TupleFilesystemStoreBackend
      suppress_store_backend_id: true
      base_directory: checkpoints/
  validation_definition_store:
    class_name: ValidationDefinitionStore
    store_backend:
      class_name: TupleFilesystemStoreBackend
      base_directory: validation_definitions/
expectations_store_name: expectations_store
validation_results_store_name: validation_results_store
checkpoint_store_name: checkpoint_store

data_docs_sites:

local_site:
   class_name: SiteBuilder
   show_how_to_buttons: true
   store_backend:
     class_name: TupleFilesystemStoreBackend
     base_directory: uncommitted/data_docs/local_site/
   site_index_builder:
     class_name: DefaultSiteIndexBuilder
fluent_datasources:
 my_ds:
   type: sqlite
   id: 0717e05c-7846-40ca-8b77-834fa9f74d0b
   assets:
     my_data_asset:
       type: query
       id: d891b4e5-104c-481d-836c-aa8e4395691c
       batch_metadata: {}
       batch_definitions:
         FULL_TABLE:
           id: 4814ef8f-38d8-4f04-ad89-d35f8f2a0fc1
           partitioner:
       query: SELECT * from mytable
   connection_string: ${myconnectionstring}
 employee_department:
   type: pandas_filesystem
   id: fb484e5a-d57f-444b-897f-2af191d8fe16
   assets:
     employee_department:
       type: csv
       id: 5ce9d256-4b94-4e02-b2b9-135765006abb
   base_directory: D:\Veronica\PyCharmProjects\gx_test_135
 dq:
   type: sqlite
   id: acc48a7b-f5af-425c-9dfc-6d7685283c80
   assets:
     dq_asset:
       type: query
       id: 52c24fe5-1c44-4e60-a4f0-b356b7bb9c53
       batch_metadata: {}
       batch_definitions:
         FULL_TABLE:
           id: 51874da9-477a-4f79-92af-e8e472bd4c6a
           partitioner:
       query: SELECT * from asset_name = myasset
   connection_string: sqlite:///D:\Veronica\PyCharmProjects\ds-gxupgrade\dq.db
 dq_employee:
   type: pandas_filesystem
   id: ab12ba8f-61ea-45ef-a4b3-28ed985629c4
   assets:
     employee_department:
       type: csv
       id: 29da965b-7516-4f82-8ffe-ad4b2685cc12
       batch_definitions:
         employee_department:
           id: 54381107-2029-4eca-801c-dbd09fe73353
           partitioner:
             regex: employee_department.csv$
             param_names: []
   base_directory: D:\Veronica\PyCharmProjects\ds-gxupgrade
my_data_pandas:
   type: pandas
   id: 2d65fc95-0040-4a5b-bfb9-433288e9918a
   assets:
     pandas_asset:
       type: dataframe
       id: 02b58697-1ca2-4b46-ad6b-b80bf6209ad2
       batch_metadata: {}
       batch_definitions:
         cr_pandas_batch:
           id: 9cd952be-a29d-4a90-a230-5ba015974584
           partitioner:
analytics_enabled: false
data_context_id: e10c5839-a0de-4edc-a051-0c43c2a83f71

My code:

from sqlalchemy import *
import great_expectations as gx
import pandas as pd

context = gx.get_context()

engine = create_engine(context.config_variables['my_connection_string'])


dataset = pd.read_sql_query("select * from my_table_data", engine)
df = pd.DataFrame(dataset)

"""Retrieve the dataframe Batch Definition"""

data_source_name = "my_data_pandas"
data_asset_name = "pandas_asset"
batch_definition_name = "cr_pandas_batch"
batch_definition = (
    context.data_sources.get(data_source_name)
    .get_asset(data_asset_name)
    .get_batch_definition(batch_definition_name)
)

batch_parameters = {"dataframe": df}


definition_name = "my_data_vd"
validation_definition = context.validation_definitions.get(definition_name)

"""Get the dataframe as a Batch"""

batch = batch_definition.get_batch(batch_parameters=batch_parameters)

result = validation_definition.run()

Expected behavior
My usecase requires the complete validation with Pandas data source.

Environment:

  • Operating System: tested in Windows and Linux
  • Great Expectations Version: 1.3.5
  • Data Source: Pandas, SQL, filesystem
  • Cloud environment: none

Additional context
It works if the expectation suite contains only one expectation (with the three different data sources); it works with multiple expectations inside the expectation suite only with filesystem data source. My usecase requires the complete validation of the dataset with Pandas data source, the other datasources have been added only for test. I checked the requirements.txt in this repository and my environment is alligned with this.

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions