Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Misc. fixes for Pytorch QA examples: #16958

Merged
merged 1 commit into from
Apr 27, 2022

Conversation

searchivarius
Copy link
Contributor

@searchivarius searchivarius commented Apr 27, 2022

Thank you for the great library! This fixes a number of issues with Pytorch QA examples. All numbers are either the same or went up. However, there are still some issues, which I wasn't able to fix (in one example). Please, see the notes and benchmark results below.

What does this PR do?

  1. Fixes evaluation errors popping up when you train/eval on squad v2 (one was newly encountered and one that was previously reported Running SQuAD 1.0 sample command raises IndexError Running SQuAD 1.0 sample command raises IndexError #15401 but not completely fixed).
  2. Removes boolean arguments that don't use store_true. Please, don't use these: ANY non-empty string is being converted to True in this case. This is clearly an undesiredbehavior, which creates a LOT of confusion.
  3. All no-trainer test scripts are now saving metric values in the same way (with the right prefix eval_), which is consistent with the trainer-based versions.
  4. Adds forgotten model.eval() in the no-trainer versions. This improved some results, but not everything (see the discussion in the end). Please, see the F1 scores and the discussion below.
  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case. This is a reduced PR as discussed here.
  • You make sure to update the documentation with your changes? I believe examples aren't covered by the documentation
  • Did you write any new necessary tests? I trained squad and squad v2 models and compared results (see the discussion below), but I am not sure if running more QA tests automatically will be feasible. Do note that the existing "unit-test" is very crude and does not permit detecting small regressions in model quality.

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Perhaps, this can be of most interest for @sgugger, who reviewed a prior version of this PR.

Comparing old and new performance + some potential issues

Some remaining issues:

  1. Despite the fixes & improvements, there's still a discrepancy between no-trainer and original version for SQuAD v2 or the beam-search version.
  2. In particular, for SQuAD v2 and the beam-search variant without trainer, both old and new numbers look very wrong to me.

Please note that to be able to run SQuAD v2 tests, I had to apply utils_qa.py fixes to the old code as well. Otherwise, it would have just failed:

The metric is F1, the exact scores have the same pattern:

previous new
squad v1 88.4 88.4
squad v1 (no trainer) 86.7 88.5
squad v2 N/A 75.2
squad v2 (no trainer) N/A 77.1
squad v1 (beam search) 92.1 92.1
squad v1 (beam search no trainer) 90.2 91.0
squad v2 (beam search) 83.2 83.2
squad v2 (beam search no trainer) 4.9 50.1

1. Fixes evaluation errors popping up when you train/eval on squad v2 (one was newly encountered and one that was previously reported Running SQuAD 1.0 sample command raises IndexError huggingface#15401 but not completely fixed).
2. Removes boolean arguments that don't use store_true. Please, don't use these: *ANY non-empty string is being converted to True in this case and this clearly is not the desired behavior (and it creates a LOT of confusion).
3. All no-trainer test scripts are now saving metric values in the same way (with the right prefix eval_), which is consistent with the trainer-based versions.
4. Adds forgotten model.eval() in the no-trainer versions. This improved some results, but not everything (see the discussion in the end). Please, see the F1 scores and the discussion below.
@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented Apr 27, 2022

The documentation is not available anymore as the PR was closed or merged.

Copy link
Collaborator

@sgugger sgugger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for all the fixes!
cc @patil-suraj (Flax) and @gante @Rocketknight1 (TF) for information (it shouldn't break any of your examples, it's just propagating a fix :-) )

@sgugger sgugger merged commit c82e017 into huggingface:main Apr 27, 2022
chamidullinr pushed a commit to chamidullinr/transformers that referenced this pull request Apr 28, 2022
1. Fixes evaluation errors popping up when you train/eval on squad v2 (one was newly encountered and one that was previously reported Running SQuAD 1.0 sample command raises IndexError huggingface#15401 but not completely fixed).
2. Removes boolean arguments that don't use store_true. Please, don't use these: *ANY non-empty string is being converted to True in this case and this clearly is not the desired behavior (and it creates a LOT of confusion).
3. All no-trainer test scripts are now saving metric values in the same way (with the right prefix eval_), which is consistent with the trainer-based versions.
4. Adds forgotten model.eval() in the no-trainer versions. This improved some results, but not everything (see the discussion in the end). Please, see the F1 scores and the discussion below.
elusenji pushed a commit to elusenji/transformers that referenced this pull request Jun 12, 2022
1. Fixes evaluation errors popping up when you train/eval on squad v2 (one was newly encountered and one that was previously reported Running SQuAD 1.0 sample command raises IndexError huggingface#15401 but not completely fixed).
2. Removes boolean arguments that don't use store_true. Please, don't use these: *ANY non-empty string is being converted to True in this case and this clearly is not the desired behavior (and it creates a LOT of confusion).
3. All no-trainer test scripts are now saving metric values in the same way (with the right prefix eval_), which is consistent with the trainer-based versions.
4. Adds forgotten model.eval() in the no-trainer versions. This improved some results, but not everything (see the discussion in the end). Please, see the F1 scores and the discussion below.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants