From 4a4ecdf93caefc466ad0e44bec5d8a575ce346c4 Mon Sep 17 00:00:00 2001 From: Ahdra Merali <90615669+AhdraMeraliQB@users.noreply.github.com> Date: Wed, 17 Apr 2024 14:20:34 +0100 Subject: [PATCH] Document best practices for writing tests for nodes and pipelines (#3782) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit * Update kedro-catalog-0.19.json (#3724) * Update kedro-catalog-0.19.json Signed-off-by: Anthony De Bortoli * Update set_up_vscode.md Signed-off-by: Anthony De Bortoli --------- Signed-off-by: Anthony De Bortoli Signed-off-by: Ahdra Merali * Update project tests directory structure in docs Signed-off-by: Ahdra Merali * Add docs on writing tests Signed-off-by: Ahdra Merali * Drop dependency on toposort in favour of built-in graphlib (#3728) * Replace toposort with graphlib (built-in from Python 3.9) Signed-off-by: Ivan Danov * Create toposort groups only when needed Signed-off-by: Ivan Danov * Update RELEASE.md and graphlib version constraints Signed-off-by: Ivan Danov * Remove mypy-toposort Signed-off-by: Ivan Danov * Ensure that the suggest resume test has no node ordering requirement Signed-off-by: Ivan Danov * Ensure stable toposorting by grouping and ungrouping the result Signed-off-by: Ivan Danov --------- Signed-off-by: Ivan Danov Signed-off-by: Ahdra Merali * Optimise pipeline addition and creation (#3730) * Create toposort groups only when needed * Ensure that the suggest resume test has no node ordering requirement * Ensure stable toposorting by grouping and ungrouping the result * Delay toposorting until pipeline.nodes is used * Avoid using .nodes when topological order or new copy is unneeded * Copy the nodes only if tags are provided * Remove unnecessary condition in self.nodes Signed-off-by: Ivan Danov Signed-off-by: Ahdra Merali * Expand robots.txt for Kedro-Viz and Kedro-Datasets docs (#3729) * Add project to robots.txt Signed-off-by: Dmitry Sorokin * Add EOF Signed-off-by: Dmitry Sorokin --------- Signed-off-by: Dmitry Sorokin Co-authored-by: Juan Luis Cano Rodríguez Signed-off-by: Ahdra Merali * Kedro need more uv (#3740) * Kedro need more uv Signed-off-by: Nok * remove docker Signed-off-by: Nok --------- Signed-off-by: Nok Signed-off-by: Ahdra Merali * Resolve all path in Kedro (#3742) * Kedro need more uv Signed-off-by: Nok * remove docker Signed-off-by: Nok * fix broken type hint and resolve project path Signed-off-by: Nok Lam Chan * fix type hint Signed-off-by: Nok Lam Chan * remove duplicate logic Signed-off-by: Nok Lam Chan * adding nok.py is definitely an accident Signed-off-by: Nok Lam Chan * fix test Signed-off-by: Nok Lam Chan * remove print Signed-off-by: Nok Lam Chan * add test Signed-off-by: Nok --------- Signed-off-by: Nok Signed-off-by: Nok Lam Chan Signed-off-by: Ahdra Merali * Remove settings of rate limits and retries (#3769) * double linkcheck limits Signed-off-by: Nok Lam Chan * fix ratelimit Signed-off-by: Nok --------- Signed-off-by: Nok Lam Chan Signed-off-by: Nok Signed-off-by: Ahdra Merali * Improve resume suggestions (#3719) * Improve suggestions to resume a failed pipeline - if dataset (or param) is persistent & shared, don't keep looking for ancestors - only look for ancestors producing impersistent inputs - minimize number of suggested nodes (= shorter message for the same pipeline) - testable logic, tests cases outside of scenarios for sequential runner - Use _EPHEMERAL attribute - Move tests to separate file - Docstring updates --------- Signed-off-by: Ondrej Zacha Co-authored-by: Nok Lam Chan Signed-off-by: Ahdra Merali * Build docs fix (#3773) * Ignored forbidden url Signed-off-by: Elena Khaustova * Returned linkscheck retries Signed-off-by: Elena Khaustova * Removed odd comment Signed-off-by: Elena Khaustova --------- Signed-off-by: Elena Khaustova Signed-off-by: Ahdra Merali * Clarify docs around custom resolvers (#3759) * Updated custom resolver docs section Signed-off-by: Elena Khaustova * Updated advanced configuration section for consistency Signed-off-by: Elena Khaustova * Updated RELEASE.md Signed-off-by: Elena Khaustova * Updated RELEASE.md Signed-off-by: Elena Khaustova * Test linkcheck_workers decrease Signed-off-by: Elena Khaustova * Increased the By default, the linkcheck_rate_limit_timeout to default Signed-off-by: Elena Khaustova * Returned old docs build settings Signed-off-by: Elena Khaustova * Fixed typo Signed-off-by: Elena Khaustova * Ignore forbidden url Signed-off-by: Elena Khaustova * Returned linkcheck retries Signed-off-by: Elena Khaustova --------- Signed-off-by: Elena Khaustova Signed-off-by: Ahdra Merali * Add mlruns to gitignore to avoid pushing mlflow local runs to github (#3765) * Add mlruns to gitignore to avoid pushing mlflow local runs to github Signed-off-by: Yolan Honoré-Rougé * update release.md Signed-off-by: Yolan Honoré-Rougé --------- Signed-off-by: Yolan Honoré-Rougé Signed-off-by: Merel Theisen <49397448+merelcht@users.noreply.github.com> Co-authored-by: Merel Theisen <49397448+merelcht@users.noreply.github.com> Co-authored-by: Nok Lam Chan Signed-off-by: Ahdra Merali * Update the dependencies page in the docs (#3772) * Update the dependencies page Signed-off-by: Ankita Katiyar * Update docs/source/kedro_project_setup/dependencies.md Signed-off-by: Ankita Katiyar <110245118+ankatiyar@users.noreply.github.com> * Apply suggestions from code review Co-authored-by: Jo Stichbury Signed-off-by: Ankita Katiyar <110245118+ankatiyar@users.noreply.github.com> * Fix lint Signed-off-by: Ankita Katiyar * Move the last line to notes Signed-off-by: Ankita Katiyar --------- Signed-off-by: Ankita Katiyar Signed-off-by: Ankita Katiyar <110245118+ankatiyar@users.noreply.github.com> Co-authored-by: Jo Stichbury Signed-off-by: Ahdra Merali * Change pipeline test location to project root/tests (#3731) * Change pipeline test location to project root/tests Signed-off-by: lrcouto * Fix some test_pipeline tests Signed-off-by: lrcouto * Change delete pipeline to account for new structure Signed-off-by: lrcouto * Fix some tests Signed-off-by: lrcouto * Change tests path on micropkg Signed-off-by: lrcouto * Fix remaining tests Signed-off-by: lrcouto * Add changes to release notes Signed-off-by: lrcouto * Update file structure on micropackaging doc page Signed-off-by: lrcouto --------- Signed-off-by: lrcouto Signed-off-by: L. R. Couto <57910428+lrcouto@users.noreply.github.com> Signed-off-by: Ahdra Merali * Add an option for kedro new to skip telemetry (#3701) * First draft for telemetry consent flag on kedro new Signed-off-by: lrcouto * Add functioning --telemetry option to kedro new Signed-off-by: lrcouto * Update tests to acknowledge new flag Signed-off-by: lrcouto * Add tests for kedro new --telemetry flag Signed-off-by: lrcouto * Add changes to documentation and release notes Signed-off-by: lrcouto * Minor change to docs Signed-off-by: lrcouto * Lint Signed-off-by: lrcouto * Remove outdated comment and correct type hint Signed-off-by: lrcouto * Update docs/source/get_started/new_project.md Co-authored-by: Merel Theisen <49397448+merelcht@users.noreply.github.com> Signed-off-by: L. R. Couto <57910428+lrcouto@users.noreply.github.com> * Lint Signed-off-by: lrcouto * Minor change on release note Signed-off-by: lrcouto --------- Signed-off-by: lrcouto Signed-off-by: L. R. Couto <57910428+lrcouto@users.noreply.github.com> Co-authored-by: Merel Theisen <49397448+merelcht@users.noreply.github.com> Signed-off-by: Ahdra Merali * Update documentation for OmegaConfigLoader (#3778) * Update documentation for OmegaConfigLoader Signed-off-by: Puneet Saini <99470400+puneeter@users.noreply.github.com> * Update RELEASE.md Signed-off-by: Puneet Saini <99470400+puneeter@users.noreply.github.com> * Update RELEASE.md Signed-off-by: Puneet Saini <99470400+puneeter@users.noreply.github.com> * Update ignore-names.txt Signed-off-by: Juan Luis Cano Rodríguez --------- Signed-off-by: Puneet Saini <99470400+puneeter@users.noreply.github.com> Signed-off-by: Juan Luis Cano Rodríguez Co-authored-by: Juan Luis Cano Rodríguez Signed-off-by: Ahdra Merali * Fix path Signed-off-by: Ahdra Merali * Lint Signed-off-by: Ahdra Merali * Add changes to RELEASE.md Signed-off-by: Ahdra Merali * Address comments from code review Signed-off-by: Ahdra Merali * Empty Signed-off-by: Juan Luis Cano Rodríguez Signed-off-by: Ahdra Merali * Remove unneeded imports Signed-off-by: Ahdra Merali * Change recommendation from pytest config to editable install Signed-off-by: Ahdra Merali * Add negative testing example Signed-off-by: Ahdra Merali * Replace Dict with dict Signed-off-by: Ahdra Merali * Remove test classes Signed-off-by: Ahdra Merali * Change the assert step for the integration test Signed-off-by: Ahdra Merali * Fix error handling for OmegaConfigLoader (#3784) * Update omegaconf_config.py Signed-off-by: Puneet Saini <99470400+puneeter@users.noreply.github.com> * Update RELEASE.md Signed-off-by: Puneet Saini <99470400+puneeter@users.noreply.github.com> * add a more complicated test case Signed-off-by: Nok --------- Signed-off-by: Puneet Saini <99470400+puneeter@users.noreply.github.com> Signed-off-by: Nok Co-authored-by: Nok Signed-off-by: Ahdra Merali * Add Simon Brugman to TSC (#3780) Signed-off-by: Juan Luis Cano Rodríguez Signed-off-by: Ahdra Merali * Update technical_steering_committee.md (#3796) Signed-off-by: Marcin Zabłocki Signed-off-by: Ahdra Merali * Remove jmespath dependency (#3797) Signed-off-by: Merel Theisen Signed-off-by: Ahdra Merali * Update spaceflights tutorial and starter requirements for kedro-datasets optional dependencies (#3664) * Update spaceflights tutorial and starter requirements Signed-off-by: lrcouto * fix e2e tests Signed-off-by: lrcouto * Fix e2e tests by distinguishing `kedro-datasets` dependency for different python versions (#3802) Signed-off-by: Merel Theisen * Update docs/source/tutorial/tutorial_template.md Co-authored-by: Merel Theisen <49397448+merelcht@users.noreply.github.com> Signed-off-by: L. R. Couto <57910428+lrcouto@users.noreply.github.com> --------- Signed-off-by: lrcouto Signed-off-by: L. R. Couto <57910428+lrcouto@users.noreply.github.com> Signed-off-by: Merel Theisen Co-authored-by: Merel Theisen <49397448+merelcht@users.noreply.github.com> Signed-off-by: Ahdra Merali * Consider Vale's suggestions Signed-off-by: Ahdra Merali * Hide test in details Signed-off-by: Ahdra Merali * Quick fix Signed-off-by: Ahdra Merali * Typo (and wording changes) Signed-off-by: Ahdra Merali * Update robots.txt (#3803) Signed-off-by: Ankita Katiyar Co-authored-by: Juan Luis Cano Rodríguez Signed-off-by: Ahdra Merali * Add a test for transcoding loops of 1 or more nodes (#3810) Signed-off-by: Ivan Danov Signed-off-by: Ahdra Merali * Ensure no nodes can depend on themselves even when transcoding is used (#3812) * Factor out transcoding helpers into a private module Signed-off-by: Ivan Danov * Ensure node input/output validation doesn't allow transcoded self-loops Signed-off-by: Ivan Danov * Updated release note to avoid github warning Signed-off-by: Elena Khaustova --------- Signed-off-by: Ivan Danov Signed-off-by: Elena Khaustova Co-authored-by: Elena Khaustova Signed-off-by: Ahdra Merali * Update UUID telemetry docs (#3805) Signed-off-by: Dmitry Sorokin Signed-off-by: Ahdra Merali * Change path to starters test (#3816) Signed-off-by: lrcouto Signed-off-by: Ahdra Merali * Move changes in RELEASE.md to docs section Signed-off-by: Ahdra Merali * Change formatting Signed-off-by: Ahdra Merali * Revert "Change formatting" This reverts commit 9582a2282184fe3746bd1f888a99074bc4be9e68. Signed-off-by: Ahdra Merali * Apply changes from code review Signed-off-by: Ahdra Merali * Add explanation on why cleanup isn't needed Signed-off-by: Ahdra Merali * Change assert on successful pipeline to check logs Signed-off-by: Ahdra Merali * Update description of integration test under pipeline slicing Signed-off-by: Ahdra Merali * Missing formatting Signed-off-by: Ahdra Merali * Update tests directory structure Signed-off-by: Ahdra Merali --------- Signed-off-by: Anthony De Bortoli Signed-off-by: Ahdra Merali Signed-off-by: Ivan Danov Signed-off-by: Dmitry Sorokin Signed-off-by: Nok Signed-off-by: Nok Lam Chan Signed-off-by: Ondrej Zacha Signed-off-by: Elena Khaustova Signed-off-by: Yolan Honoré-Rougé Signed-off-by: Merel Theisen <49397448+merelcht@users.noreply.github.com> Signed-off-by: Ankita Katiyar Signed-off-by: Ankita Katiyar <110245118+ankatiyar@users.noreply.github.com> Signed-off-by: lrcouto Signed-off-by: L. R. Couto <57910428+lrcouto@users.noreply.github.com> Signed-off-by: Puneet Saini <99470400+puneeter@users.noreply.github.com> Signed-off-by: Juan Luis Cano Rodríguez Signed-off-by: Marcin Zabłocki Signed-off-by: Merel Theisen Signed-off-by: Ahdra Merali <90615669+AhdraMeraliQB@users.noreply.github.com> Co-authored-by: Anthony De Bortoli Co-authored-by: Ivan Danov Co-authored-by: Dmitry Sorokin <40151847+DimedS@users.noreply.github.com> Co-authored-by: Juan Luis Cano Rodríguez Co-authored-by: Nok Lam Chan Co-authored-by: Ondrej Zacha Co-authored-by: ElenaKhaustova <157851531+ElenaKhaustova@users.noreply.github.com> Co-authored-by: Yolan Honoré-Rougé <29451317+Galileo-Galilei@users.noreply.github.com> Co-authored-by: Merel Theisen <49397448+merelcht@users.noreply.github.com> Co-authored-by: Ankita Katiyar <110245118+ankatiyar@users.noreply.github.com> Co-authored-by: Jo Stichbury Co-authored-by: L. R. Couto <57910428+lrcouto@users.noreply.github.com> Co-authored-by: Puneet Saini <99470400+puneeter@users.noreply.github.com> Co-authored-by: Marcin Zabłocki Co-authored-by: Elena Khaustova --- RELEASE.md | 5 +- docs/source/development/automated_testing.md | 32 +- docs/source/tutorial/add_another_pipeline.md | 4 +- docs/source/tutorial/spaceflights_tutorial.md | 1 + docs/source/tutorial/test_a_project.md | 446 ++++++++++++++++++ 5 files changed, 474 insertions(+), 14 deletions(-) create mode 100644 docs/source/tutorial/test_a_project.md diff --git a/RELEASE.md b/RELEASE.md index fe4bc45e46..0db60ca6f9 100644 --- a/RELEASE.md +++ b/RELEASE.md @@ -6,7 +6,7 @@ * Cookiecutter errors are shown in short format without the `--verbose` flag. * Kedro commands now work from any subdirectory within a Kedro project. * Kedro CLI now provides a better error message when project commands are run outside of a project i.e. `kedro run` -* Adds the `--telemetry` flag to `kedro new`, allowing the user to register consent to have user analytics collected at the same time as the project is created. +* Added the `--telemetry` flag to `kedro new`, allowing the user to register consent to have user analytics collected at the same time as the project is created. * Dropped the dependency on `toposort` in favour of the built-in `graphlib` module. * Improved the performance of `Pipeline` object creation and summing. * Improved suggestions to resume failed pipeline runs. @@ -24,7 +24,8 @@ * Methods `_is_project` and `_find_kedro_project` have been moved to `kedro.utils`. We recommend not using private methods in your code, but if you do, please update your code to use the new location. ## Documentation changes -* Add missing description for `merge_strategy` argument in OmegaConfigLoader. +* Added missing description for `merge_strategy` argument in OmegaConfigLoader. +* Added documentation on best practices for testing nodes and pipelines. ## Community contributions Many thanks to the following Kedroids for contributing PRs to this release: diff --git a/docs/source/development/automated_testing.md b/docs/source/development/automated_testing.md index fcca80abfa..ed3efe3287 100644 --- a/docs/source/development/automated_testing.md +++ b/docs/source/development/automated_testing.md @@ -19,6 +19,17 @@ There are many testing frameworks available for Python. One of the most popular Let's look at how you can start working with `pytest` in your Kedro project. +### Prerequisite: Install your Kedro project + +Before getting started with `pytest`, it is important to ensure you have installed your project locally. This allows you to test different parts of your project by importing them into your test files. + +To install your project, navigate to your project root and run the following command: + +```bash +pip install -e . +``` + +>**NOTE**: The option `-e` installs an editable version of your project, allowing you to make changes to the project files without needing to re-install them each time. ### Install `pytest` Install `pytest` as you would install other packages with `pip`, making sure your [project's virtual environment is active](../get_started/install.md#create-a-virtual-environment-for-your-kedro-project). @@ -29,15 +40,15 @@ pip install pytest ### Create a `/tests` directory -Now that `pytest` is installed, you will need a place to put your tests. Create a `/tests` folder in the `/src` directory of your project. +Now that `pytest` is installed, you will need a place to put your tests. Create a `/tests` folder in the root directory of your project. ```bash -mkdir /src/tests +mkdir /tests ``` ### Test directory structure -The subdirectories in your project's `/tests` directory should mirror the directory structure of your project's `/src/` directory. All files in the `/tests` folder should be named `test_.py`. See an example `/src` folder below. +The subdirectories in your project's `/tests` directory should mirror the directory structure of your project's `/src/` directory. All files in the `/tests` folder should be named `test_.py`. See an example `/tests` folder below. ``` src @@ -49,12 +60,12 @@ src │ │ nodes.py │ │ ... │ -└───tests -│ └───pipelines -│ └───dataprocessing -│ │ ... -│ │ test_nodes.py -│ │ ... +tests +└───pipelines +│ └───dataprocessing +│ │ ... +│ │ test_nodes.py +│ │ ... ``` ### Create an example test @@ -96,6 +107,7 @@ Tests should be named as descriptively as possible, especially if you are workin You can read more about the [basics of using `pytest` on the getting started page](https://docs.pytest.org/en/7.1.x/getting-started.html). For help writing your own tests and using all of the features of `pytest`, see the [project documentation](https://docs.pytest.org/). + ### Run your tests To run your tests, run `pytest` from within your project's root directory. @@ -112,7 +124,7 @@ If you created the example test in the previous section, you should see the foll ... collected 1 item -src/tests/test_run.py . [100%] +tests/test_run.py . [100%] ============================== 1 passed in 0.38s =============================== ``` diff --git a/docs/source/tutorial/add_another_pipeline.md b/docs/source/tutorial/add_another_pipeline.md index 96db9c913a..cebb658594 100644 --- a/docs/source/tutorial/add_another_pipeline.md +++ b/docs/source/tutorial/add_another_pipeline.md @@ -29,7 +29,7 @@ First, take a look at the functions for the data science nodes in `src/spaceflig ```python import logging -from typing import Dict, Tuple +from typing import dict, Tuple import pandas as pd from sklearn.linear_model import LinearRegression @@ -37,7 +37,7 @@ from sklearn.metrics import r2_score from sklearn.model_selection import train_test_split -def split_data(data: pd.DataFrame, parameters: Dict) -> Tuple: +def split_data(data: pd.DataFrame, parameters: dict[str, Any]) -> Tuple: """Splits data into features and targets training and test sets. Args: diff --git a/docs/source/tutorial/spaceflights_tutorial.md b/docs/source/tutorial/spaceflights_tutorial.md index 69d257ab5f..f927fe67e9 100644 --- a/docs/source/tutorial/spaceflights_tutorial.md +++ b/docs/source/tutorial/spaceflights_tutorial.md @@ -15,6 +15,7 @@ tutorial_template set_up_data create_a_pipeline add_another_pipeline +test_a_project package_a_project spaceflights_tutorial_faqs ``` diff --git a/docs/source/tutorial/test_a_project.md b/docs/source/tutorial/test_a_project.md new file mode 100644 index 0000000000..ffa75d5f55 --- /dev/null +++ b/docs/source/tutorial/test_a_project.md @@ -0,0 +1,446 @@ +# Test a Kedro project + +It is important to test our Kedro projects to validate and verify that our nodes and pipelines behave as we expect them to. In this section we look at some example tests for the spaceflights project. + +This section explains the following: + +* How to test a Kedro node +* How to test a Kedro pipeline +* Testing best practices + + +This section does not cover: + +* Automating your tests - instead read our [automated testing documentation](../development/automated_testing.md). +* More advanced features of testing, including [mocking](https://realpython.com/python-mock-library/#what-is-mocking) and [parameterising tests](https://docs.pytest.org/en/7.1.x/example/parametrize.html). + + +## Writing tests for Kedro nodes: Unit testing + +Kedro expects node functions to be [pure functions](https://realpython.com/python-functional-programming/#what-is-functional-programming); a pure function is one whose output follows solely from its inputs, without any observable side effects. Testing these functions checks that a node will behave as expected - for a given set of input values, a node will produce the expected output. These tests are referred to as unit tests. + +Let us explore what this looks like in practice. Consider the node function `split_data` defined in the data science pipeline: + +
+Click to expand + +```python +def split_data(data: pd.DataFrame, parameters: dict[str, Any]) -> Tuple: + """Splits data into features and targets training and test sets. + + Args: + data: Data containing features and target. + parameters: Parameters defined in parameters_data_science.yml. + Returns: + Split data. + """ + X = data[parameters["features"]] + y = data["price"] + X_train, X_test, y_train, y_test = train_test_split( + X, y, test_size=parameters["test_size"], random_state=parameters["random_state"] + ) + return X_train, X_test, y_train, y_test +``` + +
+ +The function takes a pandas `DataFrame` and dictionary of parameters as input, and splits the input data into four different data objects as per the parameters provided. We recommend following [pytest's anatomy of a test](https://docs.pytest.org/en/7.1.x/explanation/anatomy.html#anatomy-of-a-test) which breaks a test down into four steps: arrange, act, assert, and cleanup. For this specific function, these steps will be: + +1. Arrange: Prepare the inputs `data` and `parameters`. +2. Act: Make a call to `split_data` and capture the outputs with `X_train`, `X_test`, `Y_train`, and `Y_test`. +3. Assert: Ensure that the length of the outputs are the same as the expected lengths + +The cleanup step becomes necessary in a test when any of the previous steps make modifications that may influence other tests - e.g. by modifying a file used as input for several tests. This is not the case for the example tests below, and so the cleanup step is omitted. + +Remember to import the function being tested and any necessary modules at the top of the file. + +When we put these steps together, we have the following test: + +
+Click to expand + +```python +# NOTE: This example test is yet to be refactored. +# A complete version is available under the testing best practices section. + +import pandas as pd +from spaceflights.pipelines.data_science.nodes import split_data + + def test_split_data(): + # Arrange + dummy_data = pd.DataFrame( + { + "engines": [1, 2, 3], + "crew": [4, 5, 6], + "passenger_capacity": [5, 6, 7], + "price": [120, 290, 30], + } + ) + + dummy_parameters = { + "model_options": { + "test_size": 0.2, + "random_state": 3, + "features": ["engines", "passenger_capacity", "crew"], + } + } + + # Act + X_train, X_test, y_train, y_test = split_data(dummy_data, dummy_parameters["model_options"]) + + # Assert + assert len(X_train) == 2 + assert len(y_train) == 2 + assert len(X_test) == 1 + assert len(y_test) == 1 +``` + +
+ + +This test is an example of positive testing - it tests that a valid input produces the expected output. The inverse, testing that an invalid output will be appropriately rejected, is called negative testing and is equally as important. + +Using the same steps as above, we can write the following test to validate an error is thrown when price data is not available: + +
+Click to expand + +```python +# NOTE: This example test is yet to be refactored. +# A complete version is available under the testing best practices section. + +import pandas as pd +from spaceflights.pipelines.data_science.nodes import split_data + + def test_split_data_missing_price(): + # Arrange + dummy_data = pd.DataFrame( + { + "engines": [1, 2, 3], + "crew": [4, 5, 6], + "passenger_capacity": [5, 6, 7], + # Note the missing price data + } + ) + + dummy_parameters = { + "model_options": { + "test_size": 0.2, + "random_state": 3, + "features": ["engines", "passenger_capacity", "crew"], + } + } + + with pytest.raises(KeyError) as e_info: + # Act + X_train, X_test, y_train, y_test = split_data(dummy_data, dummy_parameters["model_options"]) + + # Assert + assert "price" in str(e_info.value) # checks that the error is about the missing price data +``` +
+ +## Writing tests for Kedro pipelines: Integration testing + +Writing tests for each node ensures each node will behave as expected when run individually. However, we must also consider how nodes in a pipeline interact with each other - this is called integration testing. Integration testing combines individual units as a group and checks whether they communicate, share data, and work together as expected. Let us look at this in practice. + +Consider the data science pipeline as a whole: + +
+Click to expand + +```python +from kedro.pipeline import Pipeline, node, pipeline +from .nodes import evaluate_model, split_data, train_model + + +def create_pipeline(**kwargs) -> Pipeline: + return pipeline( + [ + node( + func=split_data, + inputs=["model_input_table", "params:model_options"], + outputs=["X_train", "X_test", "y_train", "y_test"], + name="split_data_node", + ), + node( + func=train_model, + inputs=["X_train", "y_train"], + outputs="regressor", + name="train_model_node", + ), + node( + func=evaluate_model, + inputs=["regressor", "X_test", "y_test"], + outputs=None, + name="evaluate_model_node", + ), + ] + ) +``` +
+ +The pipeline takes a pandas `DataFrame` and dictionary of parameters as input, splits the data in accordance to the parameters, and uses it to train and evaluate a regression model. With an integration test, we can validate that this sequence of nodes runs as expected. + +From earlier in this tutorial we know a successful pipeline run will conclude with the message `Pipeline execution completed successfully.` being logged. To validate this is being logged in our tests we make use of pytest's [`caplog`](https://docs.pytest.org/en/7.1.x/how-to/logging.html#caplog-fixture) feature to capture logs generated during the execution. + +As we did with our unit tests, we break this down into several steps: + +1. Arrange: Prepare the runner and its inputs `pipeline` and `catalog`, and any additional test setup. +2. Act: Run the pipeline. +3. Assert: Ensure a successful run message was logged. + +When we put this together, we get the following test: + +
+Click to expand + +```python +# NOTE: This example test is yet to be refactored. +# A complete version is available under the testing best practices section. + +import logging +import pandas as pd +from kedro.io import DataCatalog +from kedro.runner import SequentialRunner +from spaceflights.pipelines.data_science import create_pipeline as create_ds_pipeline + + def test_data_science_pipeline(caplog): # Note: caplog is passed as an argument + # Arrange pipeline + pipeline = create_ds_pipeline() + + # Arrange data catalog + catalog = DataCatalog() + + dummy_data = pd.DataFrame( + { + "engines": [1, 2, 3], + "crew": [4, 5, 6], + "passenger_capacity": [5, 6, 7], + "price": [120, 290, 30], + } + ) + + duummy_parameters = { + "model_options": { + "test_size": 0.2, + "random_state": 3, + "features": ["engines", "passenger_capacity", "crew"], + } + } + + catalog.add_feed_dict( + { + "model_input_table" : dummy_data, + "params:model_options": dummy_parameters["model_options"], + } + ) + + # Arrange the log testing setup + caplog.set_level(logging.DEBUG, logger="kedro") # Ensure all logs produced by Kedro are captured + successful_run_msg = "Pipeline execution completed successfully." + + # Act + SequentialRunner().run(pipeline, catalog) + + # Assert + assert successful_run_msg in caplog.text + +``` + +
+ +## Testing best practices + +### Where to write your tests + +We recommend creating a `tests` directory within the root directory of your project. The structure should mirror the directory structure of `/src/spaceflights`: + +``` +src +│ ... +└───spaceflights +│ └───pipelines +│ └───data_science +│ │ __init__.py +│ │ nodes.py +│ │ pipeline.py +│ +tests +| ... +└───pipelines +│ └───data_science +│ │ test_data_science_pipeline.py +``` + + +### Using fixtures + +In our tests, we can see that `dummy_data` and `dummy_parameters` have been defined three times with (mostly) the same values. Instead, we can define these outside of our tests as [pytest fixtures](https://docs.pytest.org/en/6.2.x/fixture.html#fixture): + +
+Click to expand + +```python +import pytest + +@pytest.fixture +def dummy_data(): + return pd.DataFrame( + { + "engines": [1, 2, 3], + "crew": [4, 5, 6], + "passenger_capacity": [5, 6, 7], + "price": [120, 290, 30], + } + ) + +@pytest.fixture +def dummy_parameters(): + parameters = { + "model_options": { + "test_size": 0.2, + "random_state": 3, + "features": ["engines", "passenger_capacity", "crew"], + } + } + return parameters +``` + +
+ +We can then access these through the test arguments. + +```python +def test_split_data(dummy_data, dummy_parameters): + ... +``` + +### Pipeline slicing + +In the test `test_data_science_pipeline` we test the data science pipeline, as currently defined, can be run successfully. However, as pipelines are not static, this test is not robust. Instead we should be specific with how we define the pipeline to be tested; we do this by using [pipeline slicing](../nodes_and_pipelines/slice_a_pipeline.md#slice-a-pipeline-by-running-specified-nodes) to specify the pipeline's start and end: + +```python + def test_data_science_pipeline(self): + # Arrange pipeline + pipeline = create_pipeline().from_nodes("split_data_node").to_nodes("evaluate_model_node") + ... +``` + +This ensures that the test will still perform as designed, even with the addition of more nodes to the pipeline. + + +After incorporating these testing practices, our test file `test_data_science.py` becomes: + +
+ +```python +# tests/pipelines/test_data_science_pipeline.py + +import logging +import pandas as pd +import pytest + +from kedro.io import DataCatalog +from kedro.runner import SequentialRunner +from spaceflights.pipelines.data_science import create_pipeline as create_ds_pipeline +from spaceflights.pipelines.data_science.nodes import split_data + +@pytest.fixture +def dummy_data(): + return pd.DataFrame( + { + "engines": [1, 2, 3], + "crew": [4, 5, 6], + "passenger_capacity": [5, 6, 7], + "price": [120, 290, 30], + } + ) + +@pytest.fixture +def dummy_parameters(): + parameters = { + "model_options": { + "test_size": 0.2, + "random_state": 3, + "features": ["engines", "passenger_capacity", "crew"], + } + } + return parameters + + +def test_split_data(dummy_data, dummy_parameters): + X_train, X_test, y_train, y_test = split_data( + dummy_data, dummy_parameters["model_options"] + ) + assert len(X_train) == 2 + assert len(y_train) == 2 + assert len(X_test) == 1 + assert len(y_test) == 1 + +def test_split_data_missing_price(dummy_data, dummy_parameters): + dummy_data_missing_price = dummy_data.drop(columns="price") + with pytest.raises(KeyError) as e_info: + X_train, X_test, y_train, y_test = split_data(dummy_data_missing_price, dummy_parameters["model_options"]) + + assert "price" in str(e_info.value) + +def test_data_science_pipeline(caplog, dummy_data, dummy_parameters): + pipeline = ( + create_ds_pipeline() + .from_nodes("split_data_node") + .to_nodes("evaluate_model_node") + ) + catalog = DataCatalog() + catalog.add_feed_dict( + { + "model_input_table" : dummy_data, + "params:model_options": dummy_parameters["model_options"], + } + ) + + caplog.set_level(logging.DEBUG, logger="kedro") + successful_run_msg = "Pipeline execution completed successfully." + + SequentialRunner().run(pipeline, catalog) + + assert successful_run_msg in caplog.text + +``` + +
+ +## Run your tests + +First, confirm that your project has been installed locally. This can be done by navigating to the project root and running the following command: + +```bash +pip install -e . +``` + +This step allows pytest to accurately resolve the import statements in your test files. + +>**NOTE**: The option `-e` installs an editable version of your project, allowing you to make changes to the project files without needing to re-install them each time. + +Ensure you have `pytest` installed. Please see our [automated testing documentation](../development/automated_testing.md) for more information on getting set up with pytest. + +To run your tests, run `pytest` from within your project's root directory. + +```bash +cd +pytest tests/pipelines/test_data_science.py +``` + +You should see the following output in your shell. + +``` +============================= test session starts ============================== +... +collected 2 items + +tests/pipelines/test_data_science.py .. [100%] + +============================== 2 passed in 4.38s =============================== +``` + +This output indicates that all tests ran successfully in the file `tests/pipelines/test_data_science.py`.