Skip to content

Fix low recall when limit_val_batches is set#1298

Open
vickysharma-prog wants to merge 6 commits intoweecology:mainfrom
vickysharma-prog:fix-limit-batches-recall-1232
Open

Fix low recall when limit_val_batches is set#1298
vickysharma-prog wants to merge 6 commits intoweecology:mainfrom
vickysharma-prog:fix-limit-batches-recall-1232

Conversation

@vickysharma-prog
Copy link

Description

When limit_val_batches is set (e.g., 0.1 for 10%), evaluation loads the full ground truth CSV but predictions only cover the limited images. This makes recall look very low because of "missing" predictions for images that were never processed.

Added a check in __evaluate__ that trims ground_df based on limit_val_batches value. Uses ceil(limit_val_batches * n_images) as suggested in the issue.

Also added a test case to verify the fix.

Related Issue(s)

Fixes #1232

AI-Assisted Development

  • I used AI tools (e.g., GitHub Copilot, ChatGPT, etc.) in developing this PR
  • I understand all the code I'm submitting
  • I have reviewed and validated all AI-generated code

AI tools used (if applicable):
Used for initial research and understanding the codebase structure

Copy link
Collaborator

@jveitchmichaelis jveitchmichaelis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your contribution, here's some comments:

  • Please scope the PR to only the issue (remove .gitignore changes, we could include that in another submission).
  • Please test with non-deprecated eval calls. Use m.create_trainer() with limit batches set as an argument and then call m.trainer.validate(m) or m.trainer.fit(m). You may need to set the validation interval to 1. This would better reflect a training scenario.
  • As above, this code should work with .validate() - __evaluate__ is not called during training.
  • The code for main is a little defensive. The trainer is created on init so it's almost impossible to run this function without self.trainer existing.
  • The test case does not adequately check the behavior. For example you have asserted non-negative recall, but this does not prove that recall accurately reflects the limited dataframe.
  • Please remove the LLM-inserted "fixes" and issue number from the comment in main (L1023), this is unnecessary.
  • Is this code correct in multi-GPU environments? I don't think my suggestion in the issue is correct in this case.

It's probably best for this logic to go in the RecallPrecision metric. You can filter ground_df by the image_paths that the metric was called on which is more reliable in multi-GPU.

@vickysharma-prog
Copy link
Author

Thanks for the detailed feedback @jveitchmichaelis! Really appreciate the thorough review.
Quick acknowledgments:

  1. ✅ Will remove .gitignore changes
  2. ✅ Will remove the "fixes Evaluation reports spuriously low recall if limit_batches is set #1232" comment from code
  3. ✅ Will make the code less defensive

Regarding the architectural suggestion:
It's probably best for this logic to go in the RecallPrecision metric. You can filter ground_df by the image_paths that the metric was called on which is more reliable in multi-GPU.
This makes sense - filtering at the metric level based on actual predicted image_paths would be more reliable than calculating based on limit_val_batches. Could you point me to where the RecallPrecision metric is defined so I can refactor the fix there?
For the test - I'll update it to use create_trainer() with limit_val_batches and call trainer.validate() instead of the deprecated __evaluate__ method.
Let me know if I'm understanding correctly!

@vickysharma-prog
Copy link
Author

Pushed the updates:

  • Removed .gitignore changes
  • Updated comment in main.py
  • Updated test to use create_trainer() + trainer.validate()
    Note: ReadTheDocs build seems to be failing on dependency install (uv sync --extra docs) - this appears unrelated to my changes. Let me know if I need to do anything on my end.
    Still working on understanding the RecallPrecision metric location for the architectural refactor you suggested.

@vickysharma-prog
Copy link
Author

Just pushed another commit - removed the defensive checks (hasattr and getattr) since trainer always exists.

Current changes:

  • ✅ Removed .gitignore changes
  • ✅ Removed issue reference from code comment
  • ✅ Made code less defensive
  • ✅ Updated test to use create_trainer() + trainer.validate()

Regarding the RecallPrecision metric refactor - I searched the codebase and found iou_metric and mAP_metric (from torchmetrics), but couldn't find a custom RecallPrecision metric. Still working on understanding the RecallPrecision metric location for the architectural refactor you suggested.

(The ReadTheDocs failure seems to be a dependency issue unrelated to my changes)

@jveitchmichaelis
Copy link
Collaborator

jveitchmichaelis commented Feb 5, 2026

Still working on understanding the RecallPrecision metric location for the architectural refactor you suggested.

Please update your main branch + rebase this one

@vickysharma-prog vickysharma-prog force-pushed the fix-limit-batches-recall-1232 branch from 3d832f8 to a4d9fe9 Compare February 5, 2026 14:08
@vickysharma-prog
Copy link
Author

Rebased on latest main all checks passing now!
Let me know if there's anything else to address.

@vickysharma-prog
Copy link
Author

I found the RecallPrecision logic in metrics.py.
From what I can see, the filtering could live in the metric itself by restricting ground_df to the image_paths actually seen by the metric before calling __evaluate_wrapper__. I was thinking this would happen in compute(), but I wanted to double-check that this aligns with the intended flow (vs doing this earlier in update()).
Does this sound like the right place to apply the fix?

@jveitchmichaelis
Copy link
Collaborator

  • Please review your test case. Hint: can you demonstrate this would fail before your fix?
  • You cannot handle this in update() as the underlying issue happens when __evaluate_wrapper__ is called on the full ground truth dataframe.

@vickysharma-prog
Copy link
Author

pre-commit.ci autofix

@vickysharma-prog
Copy link
Author

Moved the fix to RecallPrecision.compute() in metrics.py as suggested - now filtering ground_df to only include images that were actually predicted before calling __evaluate_wrapper__.
Removed the old fix from main.py and updated the test.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Evaluation reports spuriously low recall if limit_batches is set

2 participants