A brief tutorial for using Great Expectations, a python tool providing batteries-included data validation. It includes tooling for testing, profiling and documenting your data and integrates with many backends such as pandas dataframes, Apache Spark, SQL databases, data warehousing solutions such as Snowflake, and cloud storage offerings (S3, Azure Blob Storage, GCS). This tutorial covers the main concepts you'll need to know to use Great Expectations, gently walking you through writing and running your first expectation suite.
If anything is incomplete or unclear, don't hesitate to open an issue!
If you'd just like to read along, just open tutorial_great_expectations.ipynb
in the repository and you're good to go! We made sure all important output is available online.
If you'd like to run the tutorial without running anything on your own machine, you can open it in Google Colab.
If you have docker installed, you can pull our container to run the tutorial:
docker pull dataroots/tutorial-great-expectations && docker run -it --rm -p 8888:8888 dataroots/tutorial-great-expectations
Alternatively, clone this repository and build the container yourself:
docker build . -t tutorial-great-expectations && docker run -it --rm -p 8888:8888 tutorial-great-expectations``
Next, copy paste the URL on the last line of the output to your favorite web browser, and navigate to the tutorial_great_expectations
notebook.
Enjoy the ride!
For running the tutorial on your own machine, we reccomend using a virtual environment.
- Clone the repository
- Install the dependencies:
pip install -r requirements.txt
. - Run
jupyter notebook
in the root directory; then navigate to thetutorial_great_expectations
notebook.
If you see AttributeError: module 'great_expectations' has no attribute data_context
, you probably do not have Great Expectations installed. Make sure that it is installed and restart your kernel to fix this.
Avocado dataset provided by the Hass Avocado Board, https://hassavocadoboard.com/volume-data-projections/ .