📝 Table of Contents

About
Concepts in GE
Notebooks
Open questions
Advantages of GE
Authors
Acknowledgments

Great expectations - Open Source Data Quality Tool

Great expectations fills the void that our Data pipeline exists. As we transition from Big Data to Good Data, teams have begun to realize the importance of good data. But the definition of good has also evolved over time. We want metrics to define the health of our pipeline and beyond.

Concepts in great-expectations

Data Context
1. Data Sources
2. Data Connectors
Stores
1. Expectations Store
  - Expectations are stored
  - Backend: Azure Storage
2. Validations Store
  - Validation results are stored. Its the output when an expectation is run on a batch of data.
  - Backend: Azure Storage
3. Evaluation Parameter store
  - Yet to figure out
  - I want to compare todays rowcounts with yeaterday's rowcount for a specific batch. How do I do it?
4. Profile Store
  - There is still work happening on this one.
  - What does this mean?
5. Metrics Store
  - Metrics extracted from validation results are stored.
  - Backend: Postgres
  - Only some metrics are being inserted.
  - This feature is still evolving.
6. Checkpoint Store
  - Checkpoint definitions are stored.
  - Backend: Azure Storage
Data docs Sites
- Backend: Azure Storage
- The static html site is built.
Checkpoints
- Backend: Azure Storage
- The complete yaml files are persisted.
- Can define email alerts.

Installation

Installation was a cakewalk. I have tried this in the Databricks community edition.

Notebooks reference

Setup Cluster by installing required libraries Setup cluster
Define the BaseDataContext for GE Setup Data Context
Define expectations and run tests Data Quality

Data Docs

Please click here to look at the Data docs that GE is able to build Data Docs

Advantages (from my perspective)

The open source version is complete in terms of defining data quality checks, executing, persisting the expectations and the results. Finally the Data Docs which makes them presentable.
Helps in building a static website for data docs and can be shared across the company/team.
Can easily translate result to Spark Dataframe for persisting the result in Database.
Can build wrapper around the configs to make onboarding easier.

Open Questions or ToDo:

How to define and leverage Evaluation Parameter Store?
What is SqlAlchemyQueryStore?
What is HtmlSiteStore?
What does the error Unrecognized urn_type in ge_urn: must be 'stores' to use a metric store. mean?
How to define Database backend for the different stores?
Can we filter out rows which fail and expectation and ingest only valid rows?
How to define Data Sources like - S3 storage, Azure blob storage, Snowflake, Postgres? (Add an example)

✍️ Authors

@anilkulkarni87

🎉 Acknowledgements

Great expectations with Azure Backend

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

great-expectations.md

great-expectations.md

📝 Table of Contents

Great expectations - Open Source Data Quality Tool

Concepts in great-expectations

Installation

Notebooks reference

Data Docs

Advantages (from my perspective)

Open Questions or ToDo:

✍️ Authors

🎉 Acknowledgements

Files

great-expectations.md

Latest commit

History

great-expectations.md

File metadata and controls

📝 Table of Contents

Great expectations - Open Source Data Quality Tool

Concepts in great-expectations

Installation

Notebooks reference

Data Docs

Advantages (from my perspective)

Open Questions or ToDo:

✍️ Authors

🎉 Acknowledgements