Skip to content

[fix] make sure neptune saves the checkpoint #2900

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Mar 27, 2023

Conversation

kshitij12345
Copy link
Contributor

Neptune logger doesn't actually save the checkpoint.

Simple reproducer:

import torch
import tempfile
import neptune
import time


class Net(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.linear = torch.nn.Linear(8192, 8192)

    def forward(self, x):
        return self.linear(x)


model = Net()

run = neptune.init_run()

start = time.time()
with tempfile.NamedTemporaryFile() as tmp:
    torch.save(model.state_dict(), tmp.file)
    run['model'].upload(tmp.name)  # Fails because (upload is async and once outside of tmp context the file is deleted).

print(time.time() - start)


start = time.time()
with tempfile.NamedTemporaryFile() as tmp:
    torch.save(model.state_dict(), tmp.file)
    # Uploads in chunks.
    run['model_stream'].upload(neptune.types.File.from_stream(tmp.file))  # Works!

print(time.time() - start)

Description:

Check list:

  • New tests are added (if a new feature is added)
  • New doc strings: description and/or example code are in RST format
  • Documentation is updated (if required)

@github-actions github-actions bot added the module: contrib Contrib module label Mar 23, 2023
Copy link
Collaborator

@vfdev-5 vfdev-5 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @kshitij12345 !

Can we add a unit test for that somehow ?

@kshitij12345
Copy link
Contributor Author

Can we add a unit test for that somehow ?

I think this can be verified only by actually running a valid run and verifying that checkpoint was uploaded correctly. Right now, there is mock test which verifies that upload was called but that doesn't guarantee that upload was successful.

@vfdev-5
Copy link
Collaborator

vfdev-5 commented Mar 24, 2023

Can we add a unit test for that somehow ?

I think this can be verified only by actually running a valid run and verifying that checkpoint was uploaded correctly. Right now, there is mock test which verifies that upload was called but that doesn't guarantee that upload was successful.

I see, I think this is enough in the sense that if we use correctly Neptune API as it is from Neptune side to check whether API is working correctly and uploading artifacts to the server.

@kshitij12345 kshitij12345 marked this pull request as ready for review March 24, 2023 12:01
@vfdev-5
Copy link
Collaborator

vfdev-5 commented Mar 24, 2023

@kshitij12345 Kshiteej, can you please run the following to fix code formatting issue:

bash tests/run_code_style.sh install
bash tests/run_code_style.sh fmt

@kshitij12345
Copy link
Contributor Author

Ah, sorry. I missed that. Updated now. Thanks!

@vfdev-5 vfdev-5 merged commit b48825b into pytorch:master Mar 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module: contrib Contrib module
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants