Tests categories #1822

miguelgfierro · 2022-09-20T11:09:41Z

Description

Added some of the work that @simonzhaoms and I developed for other projects

Related Issues

Checklist:

I have followed the contribution guidelines and code style for this project.
I have added tests covering my contributions.
I have updated the documentation accordingly.
This PR is being made to staging branch and not to main branch.

AbeOmor · 2022-09-28T18:18:02Z

tests/README.md

+* **Functional tests:** These tests make sure that the components of the project not just run but their function is correct. For example, we want to test that an ML model evaluation of RMSE gives a positive number. These tests can be run asynchronously in the nightly builds and can take hours. In these tests, we want to use real data.
+* **Integration tests:** We want to make sure that the interaction between different components is correct. For example, the interaction between data ingestion pipelines and the compute where the model is trained, or between the compute and a database. These tests can be of variable length, if they are fast, we could add them to the PR gate, otherwise, we will add them to the nightly builds. For this type of tests, synthetic and real data can be used.
+* **Smoke tests:** The smoke tests are gates to the slow tests in the nightly builds to detect quick errors. If we are running a test with a large dataset that takes 4h, we want to create a faster version of the large test (maybe with a small percentage of the dataset or with 1 epoch) to ensure that it runs end-to-end without obvious failures. Smoke tests can run sequentially with functional or integration tests in the nightly builds, and should be fast, ideally less than 20min. They use the same type of data as their longer counterparts.
+* **Performance test:** The performance tests are tests that measure the computation time or memory footprint of a piece of code and make sure that this is bounded between some limits.  For this type of tests, synthetic data can be used.


Is there also going to be some sort of champion challenger test to make sure the model is performing better than another model in prod.

this is an interesting point of view @AbeOmor. Will this be in the realm of regression tests when we want to measure whether different versions of a library provide different values, or are you thinking in the context of benchmarking different models for the same dataset like this https://github.com/microsoft/recommenders/blob/main/examples/06_benchmarks/movielens.ipynb

fazamani · 2022-09-29T04:31:31Z

tests/README.md


-The nightly builds tests are executed asynchronously and can take longer. Here we include the smoke and integration tests, and their objective is to not only make sure that there are not errors, but also to make sure that the machine learning solutions are doing what we expect.
+* **Data validation tests:** In the data validation tests, we ensure that the schema for input and output data for each function in the pipeline matches the desired prespecified schema. They should be fast and can be added to the PR gate.
+* **Unit tests**: In the unit tests we just make sure the python utilities and notebooks run correctly. Unit tests are fast, ideally less than 5min and are run in every pull request. They belong to the PR gate. For this type of tests, synthetic data can be used.


one can argue that the unit tests are the tests at the lowest (atomic) level of the code, to isolate issues in specific part of the code. With this definition, running notebooks (e2e) doesn't fall into unit test. But I am not sure where to put 'test notebooks'

Yes, good thought. I was also struggling with whether to separate the notebook unit tests or put them together with the library.

Looking at the Wikipedia:

In computer programming, unit testing is a software testing method by which individual units of source code—sets of one or more computer program modules together with associated control data, usage procedures, and operating procedures—are tested to determine whether they are fit for use

Based on this definition if we consider notebook as an individual unit, or we consider it as a unit with multiple modules, it could be argued that they fit into unit tests. A different situation would be if instead of just checking if the notebook works, we also check that it returns the right information. Then that would be in the realm of functional.

setuc · 2022-09-29T04:48:52Z

I would also want to add the following tests or checks as part of the PR / scheduled builds

Code Scans to include for potential security issues or code quality checks, e.g. unused imports etc. We are currently using LGTM in our repository together with Sonnar cloud
We also want to enable container scans to find potential security issues either in python packages or the underlying OS as part of PR in addition to scheduled scans in the production pipelines.
Would we need to include an endpoint scan using OWASP?
Consideration for load testing, especially when working with larger models in CV and NLP. This is what you have documented as performance testing.

miguelgfierro · 2022-09-29T14:16:01Z

@setuc really good points. We mentioned security in the RAI tests, but not sure if it would be convenient to create a different categories. In fact, here https://en.wikipedia.org/wiki/Software_testing there is a category of security testing.

About load testing, this is a really good point. We will incorporate it in the text

miguelgfierro · 2022-10-04T11:09:58Z

@setuc @fazamani @AbeOmor I incorporated the feedback in the new commit.

miguelgfierro · 2022-10-06T09:51:23Z

Feedback from Erich Gamma:
Each project needs to come up with a strategy that ensure the quality of the project while preserving agility. It looks like you are already using different test categories following the idea of a testing pyramid https://martinfowler.com/articles/practical-test-pyramid.html (this is a pretty good article on the topic).

Given that you already have a decent amount of tests in the Recommender repo I suggest that you start to reflect about what are your pain points with your current tests. For example,

Not enough tests: make it easier to implement tests. For example by providing a library of test fixtures/setups
Flakey tests: Team effort to make tests more stable, establish culture that it is not OK to have flakey tests. File issues against these tests
Not enough coverage: Make the current coverage visible using a coverage report
Not enough automation typically an issue for smoke testing that involves UI (not an issue for your project): Investigate into a UI testing framework like
…

miguelgfierro · 2022-10-06T09:58:26Z

Quote from the link that Erich sent:

Write lots of small and fast unit tests. Write some more coarse-grained tests and very few high-level tests that test your application from end to end.

Given the shortcomings of the original names it's totally okay to come up with other names for your test layers, as long as you keep it consistent within your codebase and your team's discussions.

miguelgfierro added 3 commits September 20, 2022 13:02

testing

9e16b6f

testing

97b0f79

📝

38babbe

miguelgfierro requested review from gramhagen, anargyri, loomlike and wutaomsft as code owners September 20, 2022 11:09

miguelgfierro requested review from pradnyeshjoshi and simonzhaoms September 20, 2022 11:09

pradnyeshjoshi approved these changes Sep 20, 2022

View reviewed changes

Update README.md

38bc880

simonzhaoms approved these changes Sep 25, 2022

View reviewed changes

miguelgfierro requested a review from fazamani September 26, 2022 15:28

AbeOmor reviewed Sep 28, 2022

View reviewed changes

fazamani approved these changes Sep 29, 2022

View reviewed changes

incorporated the feedback of @setuc, @fazamani, @AbeOmor and others

6516303

miguelgfierro merged commit 6de2112 into staging Oct 6, 2022

miguelgfierro deleted the miguel/tests_readme branch October 6, 2022 18:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tests categories #1822

Tests categories #1822

miguelgfierro commented Sep 20, 2022

AbeOmor Sep 28, 2022

miguelgfierro Sep 29, 2022

fazamani Sep 29, 2022

miguelgfierro Sep 29, 2022 •

edited

Loading

setuc commented Sep 29, 2022 •

edited

Loading

miguelgfierro commented Sep 29, 2022

miguelgfierro commented Oct 4, 2022

miguelgfierro commented Oct 6, 2022

miguelgfierro commented Oct 6, 2022

Tests categories #1822

Tests categories #1822

Conversation

miguelgfierro commented Sep 20, 2022

Description

Related Issues

Checklist:

AbeOmor Sep 28, 2022

Choose a reason for hiding this comment

miguelgfierro Sep 29, 2022

Choose a reason for hiding this comment

fazamani Sep 29, 2022

Choose a reason for hiding this comment

miguelgfierro Sep 29, 2022 • edited Loading

Choose a reason for hiding this comment

setuc commented Sep 29, 2022 • edited Loading

miguelgfierro commented Sep 29, 2022

miguelgfierro commented Oct 4, 2022

miguelgfierro commented Oct 6, 2022

miguelgfierro commented Oct 6, 2022

miguelgfierro Sep 29, 2022 •

edited

Loading

setuc commented Sep 29, 2022 •

edited

Loading