[ci]
Simplify tests, add CI, patch paraphrase_mining_embeddings
#2350
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Hello!
Pull Request overview
paraphrase_mining_embeddings
such thati < j
is always true withi
andj
as paraphrase indices.Details
Tests
Before CI can be applied, we need to simplify the tests. In essence, this involved first verifying that the there is no regression on
master
by running the current (large) tests. Then, I decreased the number of testing samples to 100 across all tests, and the number of training samples down to 100-500, depending on the test setup.I then updated the expected scores across all pretrained models. Ideally, I think it would be best to test less of the pretrained models, as it allows us to actually cache the downloaded models, but alas, I think this should be fine too: the GitHub CI environment can usually download at ~250MB/s.
In a later PR I intend to improve the test suite in various ways. For example:
Test time seems to be between 6 and 15 minutes. The test time might improve further when #2345 is merged, in case we currently download unnecessary model files (e.g.
model.safetensors
andpytorch_model.bin
).CI
I've added a simple continuous integration setup. CI is extremely important for large OS projects, and it should help simplify PRs etc. a ton. The CI caches Python dependencies, but not HF models. The reason is that the test suite simply uses too many large models that they cannot all be cached across all OS-es.
After v2.3.0 I intend to deprecate Python 3.6 and 3.7. I would have added Python 3.6 to the CI, but it did not work: The Ubuntu 22.04 GitHub CI image doesn't have Python 3.6 anymore, and for Windows it isn't possible to install a recent enough
tokenizers
version from scratch.Although 3.12 has been out for quite some time, the
torch
support is still lacking, so that's why it was not included in the CI.Patch
paraphrase_mining_embeddings
When running the CI, I learned that for Python 3.10 and 3.11, on Linux only, the
paraphrase_mining_embeddings
function could return(score, i, j)
triples wherei > j
. This does not occur in other versions. I applied a simple patch to ensure thati < j
always holds, allowing the matching test to pass.Run the (fast) tests with
pytest
, and the slow tests withpytest -m slow
.