-
Notifications
You must be signed in to change notification settings - Fork 42
[WIP] A minimal implementation of staleness control #382
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Summary of ChangesHello @yanxi-chen, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces a crucial staleness control mechanism to the queue buffer system. Its primary purpose is to mitigate issues arising from a mismatch in processing speeds between the trainer and explorer, which can lead to an accumulation of stale samples and potential performance degradation. By tracking the latest model version and filtering out excessively old samples, this change ensures that the training process consistently receives fresh and relevant data, thereby enhancing stability and model performance. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a mechanism for staleness control in the queue-based storage, which is a valuable addition to prevent training on outdated samples. The implementation adds a staleness_limit config and tracks the max_model_version to filter out stale experiences. My review focuses on potential efficiency improvements. I've pointed out a micro-optimization in updating max_model_version and a more significant performance concern related to how stale samples are handled with priority queues that have sample reuse enabled, which could lead to inefficiencies and timeouts.
| if (self.staleness_limit is not None) and (self.staleness_limit > 0): | ||
| exp_list = [ | ||
| exp | ||
| for exp in exp_list | ||
| if exp.info["model_version"] | ||
| >= self.max_model_version - self.staleness_limit | ||
| ] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current implementation for staleness control has a significant inefficiency when used with a priority queue that has sample reuse enabled (reuse_cooldown_time is not None). As noted in the PR description, stale samples are filtered out after they have been retrieved from the queue, but AsyncPriorityQueue will have already re-queued them. This means stale samples are never truly purged from the buffer, and the queue might fill up with them. This can lead to get_batch repeatedly fetching and discarding stale samples, potentially causing timeouts and degrading performance.
A better approach would be to prevent re-queuing of stale samples. This logic could be moved into AsyncPriorityQueue so that it can check for staleness before re-queuing an item. This would likely require QueueStorage to provide the max_model_version to the queue instance whenever it's updated. While this is a WIP, this is a critical design point to address for the feature to be robust.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make sense, need to update AsyncPriorityQueue to account for staleness control.
| for exp in exp_list: | ||
| if exp.info["model_version"] > self.max_model_version: | ||
| self.max_model_version = exp.info["model_version"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The loop to update max_model_version can be made more efficient. Currently, it might perform multiple assignments to self.max_model_version within a single batch. A more efficient approach is to find the maximum version within the batch first, and then update self.max_model_version only once if needed.
| for exp in exp_list: | |
| if exp.info["model_version"] > self.max_model_version: | |
| self.max_model_version = exp.info["model_version"] | |
| if exp_list: | |
| max_version_in_batch = max(exp.info["model_version"] for exp in exp_list) | |
| if max_version_in_batch > self.max_model_version: | |
| self.max_model_version = max_version_in_batch |
Description
As the title says.
Motivation:
Issues could happen with current queue buffer when
sync_style = dynamic_by_explorer:A minimal implementation of staleness control:
max_model_versioninQueueStorage(which covers FIFO queue and priority queue), namely the largest model version that has been seen so far. This property is updated every time new samples are put into the storage.Caveat:
For priority queue, even if a sample has excessive staleness, it might still be put back into the queue, which is controlled by
AsyncPriorityQueue. This might not affect correctness of the staleness control mechanism, but could be inefficient.TODO:
AsyncPriorityQueueto account for staleness control.SampleStrategyinstead. Need to decide if this is a better option.Checklist
Please check the following items before code is ready to be reviewed.