Description
Bug description
SQL execution Exception executing validations, using SQL and Pandas data sources.
I created a File Data Context, connected to SQL data, filesystem data and Pandas Dataframes. I defined my expectation suite, created a validation definition and run it. While executing
validation_results = validation_definition.run()
with SQL and Pandas datasource, I'm facing with this error:
Calculating Metrics: 86%|████████▌ | 24/28 [00:01<00:00, 20.39it/s]An SQL execution Exception occurred. OperationalError: "(psycopg2.OperationalError) **Duplicated field name in view schema: column_values.nonnull.unexpected_count**
DETAIL: java.sql.SQLException: Duplicated field name in view schema: column_values.nonnull.unexpected_count
[SQL: SELECT sum(CASE WHEN (field1 IS NULL) THEN %(param_1)s ELSE %(param_2)s END) AS "column_values.nonnull.unexpected_count", sum(CASE WHEN (field2 IS NULL) THEN %(param_3)s ELSE %(param_4)s END) AS "column_values.nonnull.unexpected_count", sum(CASE WHEN (field3 IS NULL) THEN %(param_5)s ELSE %(param_6)s END) AS "column_values.nonnull.unexpected_count", sum(CASE WHEN (field4 IS NULL) THEN %(param_7)s ELSE %(param_8)s END) AS "column_values.nonnull.unexpected_count"
FROM (SELECT *
FROM (SELECT * from my_table_data) AS anon_1
WHERE true) AS anon_1]
[parameters: {'param_1': 1, 'param_2': 0, 'param_3': 1, 'param_4': 0, 'param_5': 1, 'param_6': 0, 'param_7': 1, 'param_8': 0}]
(Background on this error at: https://sqlalche.me/e/14/e3q8)". Traceback: "Traceback (most recent call last):
File "d:\Veronica\Anaconda\envs\gxenv\lib\site-packages\sqlalchemy\engine\base.py", line 1900, in _execute_context
self.dialect.do_execute(
File "d:\Veronica\Anaconda\envs\gxenv\lib\site-packages\sqlalchemy\engine\default.py", line 736, in do_execute
cursor.execute(statement, parameters)
psycopg2.OperationalError: Duplicated field name in view schema: column_values.nonnull.unexpected_count
DETAIL: java.sql.SQLException: Duplicated field name in view schema: column_values.nonnull.unexpected_count
The dataframe hasn't duplicated field names.
To Reproduce
great_expectations.yml config:
config_version: 4.0
config_variables_file_path: uncommitted/config_variables.yml
plugins_directory: plugins/
stores:
expectations_store:
class_name: ExpectationsStore
store_backend:
class_name: TupleFilesystemStoreBackend
base_directory: expectations/
validation_results_store:
class_name: ValidationResultsStore
store_backend:
class_name: TupleFilesystemStoreBackend
base_directory: uncommitted/validations/
checkpoint_store:
class_name: CheckpointStore
store_backend:
class_name: TupleFilesystemStoreBackend
suppress_store_backend_id: true
base_directory: checkpoints/
validation_definition_store:
class_name: ValidationDefinitionStore
store_backend:
class_name: TupleFilesystemStoreBackend
base_directory: validation_definitions/
expectations_store_name: expectations_store
validation_results_store_name: validation_results_store
checkpoint_store_name: checkpoint_store
data_docs_sites:
local_site:
class_name: SiteBuilder
show_how_to_buttons: true
store_backend:
class_name: TupleFilesystemStoreBackend
base_directory: uncommitted/data_docs/local_site/
site_index_builder:
class_name: DefaultSiteIndexBuilder
fluent_datasources:
my_ds:
type: sqlite
id: 0717e05c-7846-40ca-8b77-834fa9f74d0b
assets:
my_data_asset:
type: query
id: d891b4e5-104c-481d-836c-aa8e4395691c
batch_metadata: {}
batch_definitions:
FULL_TABLE:
id: 4814ef8f-38d8-4f04-ad89-d35f8f2a0fc1
partitioner:
query: SELECT * from mytable
connection_string: ${myconnectionstring}
employee_department:
type: pandas_filesystem
id: fb484e5a-d57f-444b-897f-2af191d8fe16
assets:
employee_department:
type: csv
id: 5ce9d256-4b94-4e02-b2b9-135765006abb
base_directory: D:\Veronica\PyCharmProjects\gx_test_135
dq:
type: sqlite
id: acc48a7b-f5af-425c-9dfc-6d7685283c80
assets:
dq_asset:
type: query
id: 52c24fe5-1c44-4e60-a4f0-b356b7bb9c53
batch_metadata: {}
batch_definitions:
FULL_TABLE:
id: 51874da9-477a-4f79-92af-e8e472bd4c6a
partitioner:
query: SELECT * from asset_name = myasset
connection_string: sqlite:///D:\Veronica\PyCharmProjects\ds-gxupgrade\dq.db
dq_employee:
type: pandas_filesystem
id: ab12ba8f-61ea-45ef-a4b3-28ed985629c4
assets:
employee_department:
type: csv
id: 29da965b-7516-4f82-8ffe-ad4b2685cc12
batch_definitions:
employee_department:
id: 54381107-2029-4eca-801c-dbd09fe73353
partitioner:
regex: employee_department.csv$
param_names: []
base_directory: D:\Veronica\PyCharmProjects\ds-gxupgrade
my_data_pandas:
type: pandas
id: 2d65fc95-0040-4a5b-bfb9-433288e9918a
assets:
pandas_asset:
type: dataframe
id: 02b58697-1ca2-4b46-ad6b-b80bf6209ad2
batch_metadata: {}
batch_definitions:
cr_pandas_batch:
id: 9cd952be-a29d-4a90-a230-5ba015974584
partitioner:
analytics_enabled: false
data_context_id: e10c5839-a0de-4edc-a051-0c43c2a83f71
My code:
from sqlalchemy import *
import great_expectations as gx
import pandas as pd
context = gx.get_context()
engine = create_engine(context.config_variables['my_connection_string'])
dataset = pd.read_sql_query("select * from my_table_data", engine)
df = pd.DataFrame(dataset)
"""Retrieve the dataframe Batch Definition"""
data_source_name = "my_data_pandas"
data_asset_name = "pandas_asset"
batch_definition_name = "cr_pandas_batch"
batch_definition = (
context.data_sources.get(data_source_name)
.get_asset(data_asset_name)
.get_batch_definition(batch_definition_name)
)
batch_parameters = {"dataframe": df}
definition_name = "my_data_vd"
validation_definition = context.validation_definitions.get(definition_name)
"""Get the dataframe as a Batch"""
batch = batch_definition.get_batch(batch_parameters=batch_parameters)
result = validation_definition.run()
Expected behavior
My usecase requires the complete validation with Pandas data source.
Environment:
- Operating System: tested in Windows and Linux
- Great Expectations Version: 1.3.5
- Data Source: Pandas, SQL, filesystem
- Cloud environment: none
Additional context
It works if the expectation suite contains only one expectation (with the three different data sources); it works with multiple expectations inside the expectation suite only with filesystem data source. My usecase requires the complete validation of the dataset with Pandas data source, the other datasources have been added only for test. I checked the requirements.txt in this repository and my environment is alligned with this.