Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stepwise LR scheduler #20211

Open
wants to merge 38 commits into
base: master
Choose a base branch
from

Conversation

01AbhiSingh
Copy link
Contributor

@01AbhiSingh 01AbhiSingh commented Aug 18, 2024

What does this PR do?

Fixes #<17544>

Hii @awaelchli. Can you please verify the changes I made. If they are correct then i will take up and correct any failing tests also.

Before submitting
  • Was this discussed/agreed via a GitHub issue? (not for typos and docs)
  • Did you read the contributor guideline, Pull Request section?
  • Did you make sure your PR does only one thing, instead of bundling different changes together?
  • Did you make sure to update the documentation with your changes? (if necessary)
  • Did you write any new necessary tests? (not for typos and docs)
  • Did you verify new and existing tests pass locally with your changes?
  • Did you list all the breaking changes introduced by this pull request?
  • Did you update the CHANGELOG? (not for typos, docs, test updates, or minor internal changes/refactors)

PR review

Anyone in the community is welcome to review the PR.
Before you start reviewing, make sure you have read the review guidelines. In short, see the following bullet-list:

hii Reviewer checklist - [ ] Is this pull request ready for review? (if not, please submit in draft mode) - [ ] Check that all items from **Before submitting** are resolved - [ ] Make sure the title is self-explanatory and the description concisely explains the PR - [ ] Add labels and milestones (and optionally projects) to the PR so it can be classified

📚 Documentation preview 📚: https://pytorch-lightning--20211.org.readthedocs.build/en/20211/

@github-actions github-actions bot added the pl Generic label for PyTorch Lightning package label Aug 18, 2024
Copy link

codecov bot commented Aug 18, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 79%. Comparing base (ea59e40) to head (337c1c2).

❗ There is a different number of reports uploaded between BASE (ea59e40) and HEAD (337c1c2). Click for more details.

HEAD has 102 uploads less than BASE
Flag BASE (ea59e40) HEAD (337c1c2)
cpu 48 24
lightning_fabric 7 0
pytest 26 0
python3.9 12 6
lightning 37 18
python3.10 6 3
python3.11 12 6
python3.12.7 18 9
gpu 2 0
Additional details and impacted files
@@            Coverage Diff            @@
##           master   #20211     +/-   ##
=========================================
- Coverage      88%      79%     -9%     
=========================================
  Files         267      264      -3     
  Lines       23380    23325     -55     
=========================================
- Hits        20481    18366   -2115     
- Misses       2899     4959   +2060     

@01AbhiSingh
Copy link
Contributor Author

Hii @Borda. Do I need to make any kind of changes in the PR ?

@lantiga
Copy link
Collaborator

lantiga commented Oct 7, 2024

This looks good, thank you for the contribution @01AbhiSingh

Ideally we could add a test to verify the behavior described in #17544. The current test suite can't detect the current change and this is usually a sign of insufficient coverage. Would you be willing to contribute such test?

@01AbhiSingh
Copy link
Contributor Author

Yes, sure let me look into it.

@01AbhiSingh
Copy link
Contributor Author

Hi @lantiga , Do you want a new test written from scratch or need me to make the necessary changes in a preexisting file? All the tests have been passed. If the changes need to be made in a preexisting file, it would be very helpful if you could point out the test in which I need to make the changes, as all the tests have been passed, and due to that, I can't find the test.

@lantiga
Copy link
Collaborator

lantiga commented Nov 12, 2024

hey @01AbhiSingh sorry for the wait

You can take inspiration from:

def test_lr_scheduler_epoch_step_frequency(mocked_sched, check_val_every_n_epoch, tmp_path):

and add a new test where scheduling goes across epoch boundaries. Maybe @falckt can help too?

@01AbhiSingh
Copy link
Contributor Author

Done please check

@lantiga
Copy link
Collaborator

lantiga commented Dec 11, 2024

Hey @01AbhiSingh can you import LightningModule here?

https://github.com/Lightning-AI/pytorch-lightning/pull/20211/files#diff-3c3f104dbdd06271c9e6e6d4fdf61398458148412401dd55a9bac1e9b5f913a8R19

Change:

from lightning.pytorch import Trainer

to

from lightning.pytorch import Trainer, LightningModule

this should fix the failing test

@01AbhiSingh
Copy link
Contributor Author

Yeah, my bad. Forgot to add it even after seeing it. Done, please check.

@01AbhiSingh
Copy link
Contributor Author

https://github.com/Lightning-AI/pytorch-lightning/actions/runs/12291356552/job/34299991507?pr=20211#:~:text=FAILED%20utilities/test_data.py%3A%3Atest_update_dataloader_typerror_custom_exception%20%2D%20AssertionError%3A%20Regex%20pattern%20did%20not%20match.

This is the test that is currently failing.

def train_dataloader(self):
           # Create a simple dataset for testing
           x = torch.randn(21, 32)  # 7 batches of size 3
           y = torch.randn(21, 2)
           return DataLoader(TensorDataset(x, y), batch_size=3)

should I add this and try to run the test again ?

@lantiga
Copy link
Collaborator

lantiga commented Dec 12, 2024

Go for it : )

You can also run this kind of test locally with pytest tests/tests_pytorch/<test_file>.py::<name_of_test> to make things quicker on your end. This test in particular can be ran on any machine (and you can use Lightning Studios for free if you want to run on GPUs ofc)

@01AbhiSingh
Copy link
Contributor Author

01AbhiSingh commented Dec 12, 2024

Go for it : )

You can also run this kind of test locally with pytest tests/tests_pytorch/<test_file>.py::<name_of_test> to make things quicker on your end. This test in particular can be ran on any machine (and you can use Lightning Studios for free if you want to run on GPUs ofc)

I actually tried to run the test locally with the method you suggested but this error keeps showing up ERROR: file or directory not found: tests/tests_pytorch/test_optimizers.py anyway I am trying to solve this problem on my local env.

Edit: I've solved this problem, will now update the PR only when it's running perfectly on my local environment. Thanks :)

Another Edit 😝 : updated the PR please check

@01AbhiSingh
Copy link
Contributor Author

Test passing on my local environment but not in the PR in the repo.

@mergify mergify bot added the has conflicts label Feb 3, 2025
@01AbhiSingh
Copy link
Contributor Author

I think this time it is all done. Can you please check once ? @lantiga

Copy link
Collaborator

@lantiga lantiga left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, added a couple of comments

trainer.fit(model)

# Debug print statements
print(f"Mocked scheduler step calls: {mocked_sched.call_count}")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove the debug statements, I'd just convert them to asserts that compare the values with expected ones.

def training_step(self, batch, batch_idx):
# Add print statement to track batch index and global step
if hasattr(self, 'trainer'):
print(f"Batch idx: {batch_idx}, Global step: {self.trainer.global_step}")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Print statements in tests are not super helpful, just use asserts so the test will break if we don't get the expected value here.


# Assert that the scheduler was called the expected number of times
# Allow for a small difference due to environment or rounding discrepancies
assert abs(mocked_sched.call_count - expected_steps) <= 1, (
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure why there should be rounding discrepancies. Shouldn't this be fully deterministic?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually the test was passing in my local environment but not in the CI / CD pipeline for some reason. I forgot to change it later. Let me correct it asap.

@mergify mergify bot removed the has conflicts label Feb 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pl Generic label for PyTorch Lightning package waiting on author Waiting on user action, correction, or update
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants