Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NFS-safe Dependency Manager #3861

Merged
merged 75 commits into from
Apr 4, 2022
Merged

NFS-safe Dependency Manager #3861

merged 75 commits into from
Apr 4, 2022

Conversation

teetone
Copy link
Collaborator

@teetone teetone commented Oct 26, 2021

Reasons for making this change

Use flufl.lock to create an NFS-safe Dependency Manager.

Related issues

Resolves #3710

Checklist

  • I've added a screenshot of the changes, if this is a frontend change
  • I've added and/or updated tests, if this is a backend change
  • I've run the pre-commit.sh script
  • I've updated docs, if needed

@teetone teetone changed the title NFS locking for dependency manager NFS-safe Dependency Manager Nov 23, 2021
codalab/worker/dependency_manager.py Outdated Show resolved Hide resolved
codalab/worker/dependency_manager.py Show resolved Hide resolved
codalab/worker/dependency_manager.py Outdated Show resolved Hide resolved
codalab/worker/dependency_manager.py Show resolved Hide resolved
codalab/worker/dependency_manager.py Outdated Show resolved Hide resolved
codalab/worker/dependency_manager.py Outdated Show resolved Hide resolved
@@ -53,6 +98,9 @@ class DependencyManager(StateTransitioner, BaseDependencyManager):
# the data format of how we store this)
MAX_SERIALIZED_LEN = 60000

# If it has been this long since a worker has downloaded anything, another worker will take over downloading.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do additional testing:

  • run bundles on workers A, B, C that all share cache
  • force kill worker A while it's downloading a dependency
  • see how long it takes the dependency to be re-downloaded by B or C

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you do this yet @teetone ?

codalab/worker/dependency_manager.py Show resolved Hide resolved
codalab/worker/dependency_manager.py Outdated Show resolved Hide resolved
codalab/worker/dependency_manager.py Outdated Show resolved Hide resolved
codalab/worker/dependency_manager.py Show resolved Hide resolved
codalab/worker/dependency_manager.py Outdated Show resolved Hide resolved
codalab/worker/dependency_manager.py Show resolved Hide resolved
codalab/worker/dependency_manager.py Show resolved Hide resolved
codalab/worker/dependency_manager.py Outdated Show resolved Hide resolved
codalab/worker/dependency_manager.py Outdated Show resolved Hide resolved
codalab/worker/dependency_manager.py Show resolved Hide resolved
codalab/worker/dependency_manager.py Outdated Show resolved Hide resolved
codalab/worker/dependency_manager.py Outdated Show resolved Hide resolved
codalab/worker/worker_run_state.py Show resolved Hide resolved
codalab/worker/worker_run_state.py Show resolved Hide resolved
codalab/worker/worker_run_state.py Outdated Show resolved Hide resolved
codalab/worker_manager/slurm_batch_worker_manager.py Outdated Show resolved Hide resolved
docs/Worker-Managers.md Show resolved Hide resolved
codalab/worker/dependency_manager.py Outdated Show resolved Hide resolved
@@ -53,6 +98,9 @@ class DependencyManager(StateTransitioner, BaseDependencyManager):
# the data format of how we store this)
MAX_SERIALIZED_LEN = 60000

# If it has been this long since a worker has downloaded anything, another worker will take over downloading.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you do this yet @teetone ?

codalab/worker/dependency_manager.py Show resolved Hide resolved
codalab/worker/worker_run_state.py Show resolved Hide resolved
codalab/worker/worker_run_state.py Show resolved Hide resolved
@epicfaace
Copy link
Member

@teetone Let me know once this is ready to merge

@mergify mergify bot merged commit 2e5479d into master Apr 4, 2022
@mergify mergify bot deleted the flufl branch April 4, 2022 22:03
@epicfaace epicfaace mentioned this pull request Apr 15, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Slurm: different workers on the same VM don't share the same cache
2 participants