-
Notifications
You must be signed in to change notification settings - Fork 3.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tests categories #1822
Tests categories #1822
Conversation
tests/README.md
Outdated
* **Functional tests:** These tests make sure that the components of the project not just run but their function is correct. For example, we want to test that an ML model evaluation of RMSE gives a positive number. These tests can be run asynchronously in the nightly builds and can take hours. In these tests, we want to use real data. | ||
* **Integration tests:** We want to make sure that the interaction between different components is correct. For example, the interaction between data ingestion pipelines and the compute where the model is trained, or between the compute and a database. These tests can be of variable length, if they are fast, we could add them to the PR gate, otherwise, we will add them to the nightly builds. For this type of tests, synthetic and real data can be used. | ||
* **Smoke tests:** The smoke tests are gates to the slow tests in the nightly builds to detect quick errors. If we are running a test with a large dataset that takes 4h, we want to create a faster version of the large test (maybe with a small percentage of the dataset or with 1 epoch) to ensure that it runs end-to-end without obvious failures. Smoke tests can run sequentially with functional or integration tests in the nightly builds, and should be fast, ideally less than 20min. They use the same type of data as their longer counterparts. | ||
* **Performance test:** The performance tests are tests that measure the computation time or memory footprint of a piece of code and make sure that this is bounded between some limits. For this type of tests, synthetic data can be used. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there also going to be some sort of champion challenger test to make sure the model is performing better than another model in prod.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is an interesting point of view @AbeOmor. Will this be in the realm of regression tests when we want to measure whether different versions of a library provide different values, or are you thinking in the context of benchmarking different models for the same dataset like this https://github.com/microsoft/recommenders/blob/main/examples/06_benchmarks/movielens.ipynb
|
||
The nightly builds tests are executed asynchronously and can take longer. Here we include the smoke and integration tests, and their objective is to not only make sure that there are not errors, but also to make sure that the machine learning solutions are doing what we expect. | ||
* **Data validation tests:** In the data validation tests, we ensure that the schema for input and output data for each function in the pipeline matches the desired prespecified schema. They should be fast and can be added to the PR gate. | ||
* **Unit tests**: In the unit tests we just make sure the python utilities and notebooks run correctly. Unit tests are fast, ideally less than 5min and are run in every pull request. They belong to the PR gate. For this type of tests, synthetic data can be used. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one can argue that the unit tests are the tests at the lowest (atomic) level of the code, to isolate issues in specific part of the code. With this definition, running notebooks (e2e) doesn't fall into unit test. But I am not sure where to put 'test notebooks'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, good thought. I was also struggling with whether to separate the notebook unit tests or put them together with the library.
Looking at the Wikipedia:
In computer programming, unit testing is a software testing method by which individual units of source code—sets of one or more computer program modules together with associated control data, usage procedures, and operating procedures—are tested to determine whether they are fit for use
Based on this definition if we consider notebook as an individual unit, or we consider it as a unit with multiple modules, it could be argued that they fit into unit tests. A different situation would be if instead of just checking if the notebook works, we also check that it returns the right information. Then that would be in the realm of functional.
I would also want to add the following tests or checks as part of the PR / scheduled builds
|
@setuc really good points. We mentioned security in the RAI tests, but not sure if it would be convenient to create a different categories. In fact, here https://en.wikipedia.org/wiki/Software_testing there is a category of security testing. About load testing, this is a really good point. We will incorporate it in the text |
Feedback from Erich Gamma: Given that you already have a decent amount of tests in the Recommender repo I suggest that you start to reflect about what are your pain points with your current tests. For example,
|
Quote from the link that Erich sent:
|
Description
Added some of the work that @simonzhaoms and I developed for other projects
Related Issues
Checklist:
staging branch
and not tomain branch
.