-
Notifications
You must be signed in to change notification settings - Fork 584
fix: remove the use of BufferedIterator
#4737
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR refactors the data loading mechanism by removing BufferedIterator and replacing it with a simpler cycle_iterator to streamline iterator behavior during training. Key changes include:
- Removal of the BackgroundConsumer and BufferedIterator classes in dataloader.py.
- Replacement of BufferedIterator with cycle_iterator in training.py, along with corresponding function renaming.
- Minor format and comment updates for clarity regarding shuffling in distributed settings.
Reviewed Changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| deepmd/pt/utils/dataloader.py | Removed BufferedIterator and its background thread consumer; adjusted shuffle logic. |
| deepmd/pt/train/training.py | Updated data loader function to use cycle_iterator and replaced BufferedIterator usage. |
Comments suppressed due to low confidence (1)
deepmd/pt/train/training.py:1075
- Now that BufferedIterator is replaced with cycle_iterator, please confirm that using next(iterator) without a stop condition is intentional and that the infinite cycling behavior is well documented for future maintainers.
batch_data = next(iterator)
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Signed-off-by: Chun Cai <amoycaic@gmail.com>
for more information, see https://pre-commit.ci
📝 Walkthrough""" WalkthroughThe changes replace the previous buffered iterator mechanism for data loading with a new infinite cycling iterator. The new approach simplifies data fetching in training by continuously yielding batches without raising Changes
Sequence Diagram(s)sequenceDiagram
participant Trainer
participant DataLoader
loop Training Loop
Trainer->>DataLoader: next(batch) via cycle_iterator
DataLoader-->>Trainer: batch
Trainer->>Trainer: process batch
end
Assessment against linked issues
Suggested reviewers
Tip ⚡️ Faster reviews with caching
Enjoy the performance boost—your workflow just got faster. 📜 Recent review detailsConfiguration used: CodeRabbit UI 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
⏰ Context from checks skipped due to timeout of 90000ms (29)
✨ Finishing Touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (1)
deepmd/pt/train/training.py (1)
162-174: Docstring added – implementation looks goodThe new
cycle_iteratordoes exactly what is required (an infinite stream without buffering the whole iterable asitertools.cyclewould). The docstring added in this revision addresses the documentation gap raised in the previous review.
🧹 Nitpick comments (3)
deepmd/pt/utils/dataloader.py (1)
169-172: Clarify and simplify theshuffleexpressionThe current multi-line expression is hard to read and quite easy to mis-interpret because the comment splits the boolean logic:
shuffle=( not (dist.is_available() and dist.is_initialized()) ) # ... and shuffle,
- Moving the comment out of the expression and writing the condition in a single line improves readability.
- The intent ( “disable explicit shuffle when a distributed sampler is present” ) can be conveyed more clearly.
- shuffle=( - not (dist.is_available() and dist.is_initialized()) - ) # distributed sampler will do the shuffling by default - and shuffle, + # Disable explicit shuffling when a distributed sampler is used; + # otherwise respect the user-supplied `shuffle` flag + shuffle=shuffle and not (dist.is_available() and dist.is_initialized()),No functional change – just safer, cleaner code.
deepmd/pt/train/training.py (2)
194-206: Consider turning onpersistent_workersfor higher worker counts
cycle_iteratorrepeatedly callsiter(dataloader).
Withnum_workers > 0, a new_MultiProcessingDataLoaderIteris returned each time, but the worker pool is re-used only ifpersistent_workers=True. Without this flag PyTorch tears down and re-spawns worker processes at every cycle, which can noticeably hurt performance for small batch sizes.- _dataloader = DataLoader( + _dataloader = DataLoader( _data, sampler=_sampler, batch_size=None, num_workers=NUM_WORKERS if dist.is_available() else 0, drop_last=False, collate_fn=lambda batch: batch, pin_memory=True, + persistent_workers=NUM_WORKERS > 0, )Semantics stay identical; only CPU/GPU utilisation improves.
1080-1083: Minor style – one-liner ternary is clearerRuff’s
SIM108hint is reasonable here:- if is_train: - iterator = self.training_data - else: - iterator = self.validation_data + iterator = self.training_data if is_train else self.validation_dataPurely aesthetic – feel free to ignore if you prefer the current style.
🧰 Tools
🪛 Ruff (0.8.2)
1080-1083: Use ternary operator
iterator = self.training_data if is_train else self.validation_datainstead ofif-else-blockReplace
if-else-block withiterator = self.training_data if is_train else self.validation_data(SIM108)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
deepmd/pt/train/training.py(4 hunks)deepmd/pt/utils/dataloader.py(1 hunks)
🧰 Additional context used
🪛 Ruff (0.8.2)
deepmd/pt/train/training.py
1080-1083: Use ternary operator iterator = self.training_data if is_train else self.validation_data instead of if-else-block
Replace if-else-block with iterator = self.training_data if is_train else self.validation_data
(SIM108)
⏰ Context from checks skipped due to timeout of 90000ms (29)
- GitHub Check: Test C++ (false)
- GitHub Check: Test C++ (true)
- GitHub Check: Test Python (6, 3.12)
- GitHub Check: Test Python (6, 3.9)
- GitHub Check: Test Python (5, 3.12)
- GitHub Check: Test Python (5, 3.9)
- GitHub Check: Test Python (4, 3.12)
- GitHub Check: Test Python (4, 3.9)
- GitHub Check: Test Python (3, 3.12)
- GitHub Check: Test Python (3, 3.9)
- GitHub Check: Analyze (python)
- GitHub Check: Test Python (2, 3.12)
- GitHub Check: Test Python (2, 3.9)
- GitHub Check: Analyze (c-cpp)
- GitHub Check: Test Python (1, 3.12)
- GitHub Check: Build wheels for cp310-manylinux_aarch64
- GitHub Check: Test Python (1, 3.9)
- GitHub Check: Build wheels for cp311-win_amd64
- GitHub Check: Build C library (2.14, >=2.5.0rc0,<2.15, libdeepmd_c_cu11.tar.gz)
- GitHub Check: Build wheels for cp311-macosx_arm64
- GitHub Check: Build C library (2.18, libdeepmd_c.tar.gz)
- GitHub Check: Build wheels for cp311-macosx_x86_64
- GitHub Check: Build wheels for cp311-manylinux_x86_64
- GitHub Check: Build C++ (clang, clang)
- GitHub Check: Build C++ (rocm, rocm)
- GitHub Check: Build wheels for cp311-manylinux_x86_64
- GitHub Check: Build C++ (cuda120, cuda)
- GitHub Check: Build C++ (cuda, cuda)
- GitHub Check: Build C++ (cpu, cpu)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🔭 Outside diff range comments (1)
deepmd/pt/train/training.py (1)
176-196:⚠️ Potential issue
with torch.device("cpu")is not a context manager &collate_fnnow breaks downstream codeTwo issues introduced in this block will raise runtime errors or silently change behaviour:
torch.devicereturns a plaintorch.deviceobject; it does not implement__enter__/__exit__. Using it in awithstatement will raiseAttributeError: __enter__.
collate_fn=lambda batch: batchmakes the DataLoader yield a list of length-1 instead of the expecteddict. A few lines below (batch_data.keys()) will then crash with
AttributeError: 'list' object has no attribute 'keys'.Fix both issues:
_dataloader = DataLoader( _data, sampler=_sampler, batch_size=None, @@ - collate_fn=lambda batch: batch, # prevent extra conversion + # Return the single element so that `get_data` receives a dict + collate_fn=lambda batch: batch[0], pin_memory=True, ) - with torch.device("cpu"): - _data_iter = cycle_iterator(_dataloader) + # Iterator lives on CPU; no context manager is required here + _data_iter = cycle_iterator(_dataloader) return _dataloader, _data_iterWithout this patch training will fail the first time the loader is iterated.
🧹 Nitpick comments (2)
deepmd/pt/train/training.py (2)
162-174: Docstring is great – consider adding minimal type hints for clarityThe helper is small and clear, but because it is now a central piece of the new data-loading strategy it may be worth giving readers a precise idea of what goes in and what comes out:
- def cycle_iterator(iterable: Iterable): + from typing import Iterator, TypeVar + T = TypeVar("T") + + def cycle_iterator(iterable: Iterable[T]) -> Iterator[T]:Totally optional, yet this 3-line change makes
mypy/IDE autocompletion happier and documents intent in code instead of prose.
1080-1089: Minor style: can be flattened & protect against empty batchesA ternary keeps the intent short (ruff hint SIM108):
- if is_train: - iterator = self.training_data - else: - iterator = self.validation_data + iterator = self.training_data if is_train else self.validation_dataAdditionally, consider an early guard:
batch_data = next(iterator, None) if batch_data is None: # iterator exhausted unexpectedly return {}, {}, {}Not critical (the infinite cycling iterator should avoid exhaustion) but gives a graceful exit if something upstream changes.
🧰 Tools
🪛 Ruff (0.8.2)
1080-1083: Use ternary operator
iterator = self.training_data if is_train else self.validation_datainstead ofif-else-blockReplace
if-else-block withiterator = self.training_data if is_train else self.validation_data(SIM108)
1089-1089: Use
key in dictinstead ofkey in dict.keys()Remove
.keys()(SIM118)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
deepmd/pt/train/training.py(4 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (1)
deepmd/pt/train/training.py (1)
deepmd/pd/train/training.py (1)
get_data_loader(156-201)
🪛 Ruff (0.8.2)
deepmd/pt/train/training.py
1080-1083: Use ternary operator iterator = self.training_data if is_train else self.validation_data instead of if-else-block
Replace if-else-block with iterator = self.training_data if is_train else self.validation_data
(SIM108)
⏰ Context from checks skipped due to timeout of 90000ms (30)
- GitHub Check: Analyze (python)
- GitHub Check: Build wheels for cp310-manylinux_aarch64
- GitHub Check: Analyze (javascript-typescript)
- GitHub Check: Build wheels for cp311-win_amd64
- GitHub Check: Analyze (c-cpp)
- GitHub Check: Build wheels for cp311-macosx_arm64
- GitHub Check: Build wheels for cp311-macosx_x86_64
- GitHub Check: Build wheels for cp311-manylinux_x86_64
- GitHub Check: Build wheels for cp311-manylinux_x86_64
- GitHub Check: Test C++ (false)
- GitHub Check: Test C++ (true)
- GitHub Check: Build C library (2.14, >=2.5.0rc0,<2.15, libdeepmd_c_cu11.tar.gz)
- GitHub Check: Build C library (2.18, libdeepmd_c.tar.gz)
- GitHub Check: Test Python (6, 3.12)
- GitHub Check: Test Python (6, 3.9)
- GitHub Check: Test Python (5, 3.12)
- GitHub Check: Test Python (5, 3.9)
- GitHub Check: Test Python (4, 3.12)
- GitHub Check: Test Python (4, 3.9)
- GitHub Check: Test Python (3, 3.12)
- GitHub Check: Test Python (3, 3.9)
- GitHub Check: Build C++ (clang, clang)
- GitHub Check: Test Python (2, 3.12)
- GitHub Check: Build C++ (rocm, rocm)
- GitHub Check: Test Python (2, 3.9)
- GitHub Check: Build C++ (cuda120, cuda)
- GitHub Check: Test Python (1, 3.12)
- GitHub Check: Build C++ (cuda, cuda)
- GitHub Check: Test Python (1, 3.9)
- GitHub Check: Build C++ (cpu, cpu)
for more information, see https://pre-commit.ci
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
deepmd/pt/train/training.py (1)
1083-1091: Simplify the iterator selection with a ternary operator.The code can be more concise by using a ternary operator for the iterator selection.
- if is_train: - iterator = self.training_data - else: - iterator = self.validation_data + iterator = self.training_data if is_train else self.validation_dataThe handling of
Noneiterator and multi-task scenario is well implemented and should correctly handle the case when no validation iterator is present.🧰 Tools
🪛 Ruff (0.8.2)
1083-1086: Use ternary operator
iterator = self.training_data if is_train else self.validation_datainstead ofif-else-blockReplace
if-else-block withiterator = self.training_data if is_train else self.validation_data(SIM108)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
deepmd/pt/train/training.py(4 hunks)
🧰 Additional context used
🪛 Ruff (0.8.2)
deepmd/pt/train/training.py
1083-1086: Use ternary operator iterator = self.training_data if is_train else self.validation_data instead of if-else-block
Replace if-else-block with iterator = self.training_data if is_train else self.validation_data
(SIM108)
⏰ Context from checks skipped due to timeout of 90000ms (29)
- GitHub Check: Test Python (6, 3.12)
- GitHub Check: Test Python (6, 3.9)
- GitHub Check: Test Python (5, 3.12)
- GitHub Check: Build wheels for cp310-manylinux_aarch64
- GitHub Check: Test Python (5, 3.9)
- GitHub Check: Build wheels for cp311-win_amd64
- GitHub Check: Test Python (4, 3.12)
- GitHub Check: Build wheels for cp311-macosx_arm64
- GitHub Check: Test Python (4, 3.9)
- GitHub Check: Build wheels for cp311-macosx_x86_64
- GitHub Check: Test Python (3, 3.12)
- GitHub Check: Build wheels for cp311-manylinux_x86_64
- GitHub Check: Test Python (3, 3.9)
- GitHub Check: Build wheels for cp311-manylinux_x86_64
- GitHub Check: Test Python (2, 3.12)
- GitHub Check: Test Python (2, 3.9)
- GitHub Check: Build C++ (clang, clang)
- GitHub Check: Test Python (1, 3.9)
- GitHub Check: Build C++ (rocm, rocm)
- GitHub Check: Test C++ (false)
- GitHub Check: Build C++ (cuda120, cuda)
- GitHub Check: Test C++ (true)
- GitHub Check: Build C++ (cuda, cuda)
- GitHub Check: Build C++ (cpu, cpu)
- GitHub Check: Build C library (2.14, >=2.5.0rc0,<2.15, libdeepmd_c_cu11.tar.gz)
- GitHub Check: Analyze (python)
- GitHub Check: Analyze (javascript-typescript)
- GitHub Check: Build C library (2.18, libdeepmd_c.tar.gz)
- GitHub Check: Analyze (c-cpp)
🔇 Additional comments (4)
deepmd/pt/train/training.py (4)
5-7: Import correctly added for the Iterable type.The addition of the
Iterableimport fromcollections.abcproperly supports the type annotation in the newcycle_iteratorfunction.
164-177: Good implementation of the cycle_iterator with complete docstring.The newly added
cycle_iteratorfunction provides a clean and simple way to create an infinite iterator that cycles through the given iterable. The implementation usingyield fromis elegant and efficient. The function has proper documentation including argument types and return values.This is a good replacement for the previous
BufferedIteratorclass that simplifies the data loading mechanism and avoids the problematic behavior during garbage collection and multiprocessing.
179-198: Function properly renamed and refactored to use cycle_iterator.The function has been correctly renamed from
get_dataloader_and_buffertoget_dataloader_and_iterto reflect the change in implementation. The returned iterator is now created using the newcycle_iteratorfunction instead of the previousBufferedIterator.
200-223: Variable names consistently updated to reflect the implementation change.All relevant variable names have been updated from references to "buffer" to references to "iter", maintaining consistency throughout the code.
for more information, see https://pre-commit.ci
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (2)
source/tests/pt/model/test_saveload_se_e2_a.py (1)
75-78: Consider importing cycle_iterator instead of defining it locallyThe
cycle_iteratorfunction is well-implemented as an infinite generator, but it's also defined indeepmd/pt/train/training.pywith proper typing and documentation. Consider importing it from there to avoid code duplication and ensure consistent behavior across the codebase.- def cycle_iterator(iterable): - while True: - yield from iterable + from deepmd.pt.train.training import cycle_iteratorsource/tests/pt/model/test_saveload_dpa1.py (1)
75-78: Consider importing cycle_iterator instead of defining it locallyThe
cycle_iteratorfunction is implemented identically in both test files. To follow the DRY principle (Don't Repeat Yourself), consider importing it fromdeepmd/pt/train/training.pywhere it's already defined with proper typing and documentation.- def cycle_iterator(iterable): - while True: - yield from iterable + from deepmd.pt.train.training import cycle_iterator
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
source/tests/pt/model/test_saveload_dpa1.py(2 hunks)source/tests/pt/model/test_saveload_se_e2_a.py(2 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (2)
source/tests/pt/model/test_saveload_se_e2_a.py (2)
source/tests/pt/model/test_saveload_dpa1.py (1)
cycle_iterator(75-77)deepmd/pt/train/training.py (1)
cycle_iterator(164-176)
source/tests/pt/model/test_saveload_dpa1.py (2)
source/tests/pt/model/test_saveload_se_e2_a.py (1)
cycle_iterator(75-77)deepmd/pt/train/training.py (1)
cycle_iterator(164-176)
⏰ Context from checks skipped due to timeout of 90000ms (29)
- GitHub Check: Build C++ (clang, clang)
- GitHub Check: Build C++ (rocm, rocm)
- GitHub Check: Build C++ (cuda120, cuda)
- GitHub Check: Build C++ (cuda, cuda)
- GitHub Check: Build C++ (cpu, cpu)
- GitHub Check: Test Python (6, 3.12)
- GitHub Check: Test Python (6, 3.9)
- GitHub Check: Test Python (5, 3.12)
- GitHub Check: Test Python (5, 3.9)
- GitHub Check: Test Python (4, 3.12)
- GitHub Check: Test Python (4, 3.9)
- GitHub Check: Test Python (3, 3.12)
- GitHub Check: Test Python (3, 3.9)
- GitHub Check: Test Python (2, 3.12)
- GitHub Check: Test Python (2, 3.9)
- GitHub Check: Test Python (1, 3.12)
- GitHub Check: Build wheels for cp310-manylinux_aarch64
- GitHub Check: Test Python (1, 3.9)
- GitHub Check: Build wheels for cp311-win_amd64
- GitHub Check: Build wheels for cp311-macosx_arm64
- GitHub Check: Build wheels for cp311-macosx_x86_64
- GitHub Check: Build wheels for cp311-manylinux_x86_64
- GitHub Check: Build wheels for cp311-manylinux_x86_64
- GitHub Check: Test C++ (false)
- GitHub Check: Test C++ (true)
- GitHub Check: Build C library (2.14, >=2.5.0rc0,<2.15, libdeepmd_c_cu11.tar.gz)
- GitHub Check: Build C library (2.18, libdeepmd_c.tar.gz)
- GitHub Check: Analyze (python)
- GitHub Check: Analyze (c-cpp)
🔇 Additional comments (4)
source/tests/pt/model/test_saveload_se_e2_a.py (2)
80-80: LGTM! Proper replacement of BufferedIterator with cycle_iteratorThis change correctly implements the new infinite cycling iterator pattern, which aligns with the PR objective to fix issues related to garbage collection with PyTorch DataLoader workers.
112-112: LGTM! Simplified data retrievalThe simplified
get_datamethod now directly callsnexton the infinite iterator without needing to handleStopIterationexceptions or reinitialize the iterator. This is cleaner and more maintainable.source/tests/pt/model/test_saveload_dpa1.py (2)
80-80: LGTM! Proper replacement of BufferedIterator with cycle_iteratorThis change correctly implements the new infinite cycling iterator approach, which aligns with the PR objective to fix garbage collection issues with PyTorch DataLoader workers.
118-118: LGTM! Simplified data retrievalThe method now directly calls
nexton the infinite iterator without handlingStopIterationexceptions. This simplification makes the code cleaner and aligns with the new data loading mechanism.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (2)
deepmd/pt/train/training.py (2)
164-177: Well-implemented infinite cycling iterator with proper docstring.This new
cycle_iteratorfunction elegantly replaces the previousBufferedIteratorwith a simpler approach that continuously cycles through the dataloader. The docstring is comprehensive and the use ofwith torch.device("cpu"):helps avoid potential GPU memory issues during iteration.Two minor docstring format improvements:
- The "Yields" section uses dashes while "Args" doesn't - consider standardizing
- The return description could specify it's an infinite iterator
def cycle_iterator(iterable: Iterable): """ Produces an infinite iterator by repeatedly cycling through the given iterable. Args: iterable (Iterable): The iterable to cycle through. - Yields - ------ - Any: The next item from the iterable, cycling back to the beginning when the end is reached. + Yields: + Any: The next item from the iterable, cycling back to the beginning when the end is reached. """ with torch.device("cpu"): while True: yield from iterable
1084-1092: Simplified data fetching with proper null checks.The
get_datamethod has been simplified by directly using the infinite cycling iterator without needing to handleStopIterationexceptions. The null check foriterator is Noneensures the method gracefully handles the case where there's no iterator.As identified by the static analyzer, you could further simplify this code with a ternary operator.
- def get_data(self, is_train=True, task_key="Default"): - if is_train: - iterator = self.training_data - else: - iterator = self.validation_data + def get_data(self, is_train=True, task_key="Default"): + iterator = self.training_data if is_train else self.validation_data if self.multi_task: iterator = iterator[task_key] if iterator is None:🧰 Tools
🪛 Ruff (0.8.2)
1084-1087: Use ternary operator
iterator = self.training_data if is_train else self.validation_datainstead ofif-else-blockReplace
if-else-block withiterator = self.training_data if is_train else self.validation_data(SIM108)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
deepmd/pt/train/training.py(4 hunks)
🧰 Additional context used
🪛 Ruff (0.8.2)
deepmd/pt/train/training.py
1084-1087: Use ternary operator iterator = self.training_data if is_train else self.validation_data instead of if-else-block
Replace if-else-block with iterator = self.training_data if is_train else self.validation_data
(SIM108)
⏰ Context from checks skipped due to timeout of 90000ms (29)
- GitHub Check: Test Python (6, 3.12)
- GitHub Check: Test Python (6, 3.9)
- GitHub Check: Test Python (5, 3.12)
- GitHub Check: Test Python (5, 3.9)
- GitHub Check: Test Python (4, 3.12)
- GitHub Check: Test Python (4, 3.9)
- GitHub Check: Build wheels for cp310-manylinux_aarch64
- GitHub Check: Test Python (3, 3.12)
- GitHub Check: Build wheels for cp311-win_amd64
- GitHub Check: Test Python (3, 3.9)
- GitHub Check: Build C++ (clang, clang)
- GitHub Check: Build wheels for cp311-macosx_arm64
- GitHub Check: Test Python (2, 3.12)
- GitHub Check: Build wheels for cp311-macosx_x86_64
- GitHub Check: Build C++ (rocm, rocm)
- GitHub Check: Test Python (2, 3.9)
- GitHub Check: Build C++ (cuda120, cuda)
- GitHub Check: Build wheels for cp311-manylinux_x86_64
- GitHub Check: Analyze (python)
- GitHub Check: Test Python (1, 3.12)
- GitHub Check: Test C++ (false)
- GitHub Check: Build wheels for cp311-manylinux_x86_64
- GitHub Check: Build C++ (cuda, cuda)
- GitHub Check: Build C library (2.14, >=2.5.0rc0,<2.15, libdeepmd_c_cu11.tar.gz)
- GitHub Check: Test Python (1, 3.9)
- GitHub Check: Build C++ (cpu, cpu)
- GitHub Check: Test C++ (true)
- GitHub Check: Build C library (2.18, libdeepmd_c.tar.gz)
- GitHub Check: Analyze (c-cpp)
🔇 Additional comments (3)
deepmd/pt/train/training.py (3)
5-7: Appropriate import for the new iterator implementation.The addition of
Iterablefromcollections.abcis a proper type hint for the new cycling iterator functionality.
180-199: Function renamed to accurately reflect its new behavior.The function has been appropriately renamed from
get_dataloader_and_buffertoget_dataloader_and_iterto reflect that it now returns an infinite cycling iterator instead of a buffered iterator. The implementation correctly uses the newcycle_iteratorfunction.
208-209: Consistent application of the new cycling iterator pattern.The validation data iterator is now created using the same cycling iterator pattern as the training data, ensuring consistency in the implementation.
Also applies to: 217-217
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
deepmd/pt/train/training.py (1)
1084-1094: Significantly simplified data retrieval logic.The
get_datamethod has been greatly simplified by removing the need to handleStopIterationexceptions and iterator recreation. This addresses the core issue with garbage collection during multiprocessing.Consider using a ternary operator for better readability:
- if is_train: - iterator = self.training_data - else: - iterator = self.validation_data + iterator = self.training_data if is_train else self.validation_data🧰 Tools
🪛 Ruff (0.8.2)
1085-1088: Use ternary operator
iterator = self.training_data if is_train else self.validation_datainstead ofif-else-blockReplace
if-else-block withiterator = self.training_data if is_train else self.validation_data(SIM108)
1094-1094: Use
key in dictinstead ofkey in dict.keys()Remove
.keys()(SIM118)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
deepmd/pt/train/training.py(4 hunks)
🧰 Additional context used
🪛 Ruff (0.8.2)
deepmd/pt/train/training.py
1085-1088: Use ternary operator iterator = self.training_data if is_train else self.validation_data instead of if-else-block
Replace if-else-block with iterator = self.training_data if is_train else self.validation_data
(SIM108)
⏰ Context from checks skipped due to timeout of 90000ms (29)
- GitHub Check: Test Python (6, 3.12)
- GitHub Check: Test Python (6, 3.9)
- GitHub Check: Test Python (5, 3.12)
- GitHub Check: Test Python (5, 3.9)
- GitHub Check: Test Python (4, 3.12)
- GitHub Check: Build wheels for cp310-manylinux_aarch64
- GitHub Check: Test Python (4, 3.9)
- GitHub Check: Test Python (3, 3.12)
- GitHub Check: Build wheels for cp311-win_amd64
- GitHub Check: Build wheels for cp311-macosx_arm64
- GitHub Check: Test Python (3, 3.9)
- GitHub Check: Build wheels for cp311-macosx_x86_64
- GitHub Check: Test Python (2, 3.12)
- GitHub Check: Build wheels for cp311-manylinux_x86_64
- GitHub Check: Build C++ (clang, clang)
- GitHub Check: Test Python (2, 3.9)
- GitHub Check: Build C library (2.14, >=2.5.0rc0,<2.15, libdeepmd_c_cu11.tar.gz)
- GitHub Check: Build C++ (rocm, rocm)
- GitHub Check: Build wheels for cp311-manylinux_x86_64
- GitHub Check: Test Python (1, 3.12)
- GitHub Check: Analyze (python)
- GitHub Check: Test Python (1, 3.9)
- GitHub Check: Test C++ (false)
- GitHub Check: Build C++ (cuda120, cuda)
- GitHub Check: Build C library (2.18, libdeepmd_c.tar.gz)
- GitHub Check: Test C++ (true)
- GitHub Check: Build C++ (cuda, cuda)
- GitHub Check: Build C++ (cpu, cpu)
- GitHub Check: Analyze (c-cpp)
🔇 Additional comments (7)
deepmd/pt/train/training.py (7)
5-7: Addition of Iterable type hint import is appropriate.This import supports the type annotation in the new
cycle_iteratorfunction, ensuring proper static type checking.
164-179: Well-implemented infinite cycling iterator with clear documentation.The
cycle_iteratorfunction elegantly solves the issue withBufferedIteratorby providing a simpler mechanism that continuously cycles through data without raisingStopIteration. The docstring is clear and the implementation is concise.The
with torch.device("cpu")context is a good defensive measure to ensure iterator creation happens on CPU, which may help prevent GPU memory issues during initialization.
181-182: Function name updated to reflect new implementation.The function name change from
get_dataloader_and_buffertoget_dataloader_and_iteraccurately reflects its new implementation that returns a cycling iterator instead of a buffered iterator.
197-197: Good addition of pin_memory_device parameter.Explicitly specifying the device for pinned memory improves clarity and ensures consistency with the intended device usage.
199-200: Replaced BufferedIterator with cycle_iterator.This change is the core of the fix, eliminating the problematic BufferedIterator implementation that was causing garbage collection issues with a simpler solution.
209-210: Consistently updated validation data handling.The validation data iterator creation now uses the same simplified approach as the training data, maintaining consistency throughout the code.
Also applies to: 218-218
1090-1093: Good defensive check for None iterator.This code properly handles the case when no validation data is provided, gracefully returning empty dictionaries instead of raising an exception. This fix addresses a potential issue identified in previous reviews.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
deepmd/pt/train/training.py (1)
1082-1086: Consider using a ternary operator for iterator selection.The if-else block could be simplified.
- if is_train: - iterator = self.training_data - else: - iterator = self.validation_data + iterator = self.training_data if is_train else self.validation_data🧰 Tools
🪛 Ruff (0.8.2)
1082-1085: Use ternary operator
iterator = self.training_data if is_train else self.validation_datainstead ofif-else-blockReplace
if-else-block withiterator = self.training_data if is_train else self.validation_data(SIM108)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
deepmd/pt/train/training.py(4 hunks)deepmd/pt/utils/dataloader.py(2 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- deepmd/pt/utils/dataloader.py
🧰 Additional context used
🪛 Ruff (0.8.2)
deepmd/pt/train/training.py
1082-1085: Use ternary operator iterator = self.training_data if is_train else self.validation_data instead of if-else-block
Replace if-else-block with iterator = self.training_data if is_train else self.validation_data
(SIM108)
⏰ Context from checks skipped due to timeout of 90000ms (29)
- GitHub Check: Test Python (6, 3.12)
- GitHub Check: Test Python (6, 3.9)
- GitHub Check: Test Python (5, 3.12)
- GitHub Check: Test Python (5, 3.9)
- GitHub Check: Test Python (4, 3.12)
- GitHub Check: Test Python (4, 3.9)
- GitHub Check: Test Python (3, 3.12)
- GitHub Check: Test Python (3, 3.9)
- GitHub Check: Test Python (2, 3.12)
- GitHub Check: Build wheels for cp310-manylinux_aarch64
- GitHub Check: Test Python (2, 3.9)
- GitHub Check: Build wheels for cp311-win_amd64
- GitHub Check: Test Python (1, 3.12)
- GitHub Check: Build wheels for cp311-macosx_arm64
- GitHub Check: Build C++ (clang, clang)
- GitHub Check: Test Python (1, 3.9)
- GitHub Check: Build wheels for cp311-macosx_x86_64
- GitHub Check: Build C++ (rocm, rocm)
- GitHub Check: Build C++ (cuda120, cuda)
- GitHub Check: Build wheels for cp311-manylinux_x86_64
- GitHub Check: Build C++ (cuda, cuda)
- GitHub Check: Analyze (python)
- GitHub Check: Build wheels for cp311-manylinux_x86_64
- GitHub Check: Test C++ (false)
- GitHub Check: Build C++ (cpu, cpu)
- GitHub Check: Build C library (2.14, >=2.5.0rc0,<2.15, libdeepmd_c_cu11.tar.gz)
- GitHub Check: Test C++ (true)
- GitHub Check: Analyze (c-cpp)
- GitHub Check: Build C library (2.18, libdeepmd_c.tar.gz)
🔇 Additional comments (8)
deepmd/pt/train/training.py (8)
5-7: Proper import for iterator type hinting.Good addition of the
Iterableimport fromcollections.abcfor proper type hinting in the new cycling iterator implementation.
164-177: Well-documented cycle_iterator implementation.The new
cycle_iteratorfunction is a clean, well-documented replacement for the previousBufferedIterator. It provides an infinite yielding pattern that avoidsStopIterationexceptions, addressing the root cause of the garbage collection issues mentioned in the PR objectives.
179-180: Function renamed to better reflect its purpose.Renaming from
get_dataloader_and_buffertoget_dataloader_and_iteraccurately reflects the implementation change from buffered iterators to cycling iterators.
196-197: Simplified iteration mechanism.Replacing
BufferedIteratorwith the simplercycle_iteratoreliminates the complex background prefetching thread, addressing the garbage collection issues when using PyTorch DataLoader with multiple threads and processes.
199-201: Consistent naming convention with implementation changes.Variable names updated from
buffertoiterconsistently reflect the change in approach.
205-209: Consistent naming in validation data handling.Updates to variable names maintain consistency throughout the codebase.
214-216: Proper handling for missing validation data.Correctly initializes
validation_data_iterasNonewhen validation data is not provided.
1087-1091: Robust handling of None iterators.The code correctly checks if the iterator is None before attempting to get the next batch, preventing the TypeError that was previously raised when no validation data was available.
This is a significant improvement that makes the code more robust and addresses an issue mentioned in the past review comments.
🧰 Tools
🪛 Ruff (0.8.2)
1091-1091: Use
key in dictinstead ofkey in dict.keys()Remove
.keys()(SIM118)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
deepmd/pt/train/training.py (1)
1083-1092: Simplified data fetching with proper null checkingThe
get_datamethod has been significantly simplified by removing the previous exception handling pattern. The addition of the null check on line 1090 ensures graceful handling when no validation data exists.The conditional block could be simplified using a ternary operator:
- if is_train: - iterator = self.training_data - else: - iterator = self.validation_data + iterator = self.training_data if is_train else self.validation_data🧰 Tools
🪛 Ruff (0.8.2)
1084-1087: Use ternary operator
iterator = self.training_data if is_train else self.validation_datainstead ofif-else-blockReplace
if-else-block withiterator = self.training_data if is_train else self.validation_data(SIM108)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
deepmd/pt/train/training.py(4 hunks)
🧰 Additional context used
🪛 Ruff (0.8.2)
deepmd/pt/train/training.py
1084-1087: Use ternary operator iterator = self.training_data if is_train else self.validation_data instead of if-else-block
Replace if-else-block with iterator = self.training_data if is_train else self.validation_data
(SIM108)
⏰ Context from checks skipped due to timeout of 90000ms (29)
- GitHub Check: Test Python (6, 3.12)
- GitHub Check: Test Python (6, 3.9)
- GitHub Check: Test Python (5, 3.12)
- GitHub Check: Test Python (5, 3.9)
- GitHub Check: Test Python (4, 3.12)
- GitHub Check: Build wheels for cp310-manylinux_aarch64
- GitHub Check: Test Python (4, 3.9)
- GitHub Check: Build wheels for cp311-win_amd64
- GitHub Check: Test Python (3, 3.12)
- GitHub Check: Build wheels for cp311-macosx_arm64
- GitHub Check: Build C++ (clang, clang)
- GitHub Check: Test Python (3, 3.9)
- GitHub Check: Build wheels for cp311-macosx_x86_64
- GitHub Check: Analyze (python)
- GitHub Check: Build C++ (rocm, rocm)
- GitHub Check: Test Python (2, 3.12)
- GitHub Check: Build wheels for cp311-manylinux_x86_64
- GitHub Check: Test Python (2, 3.9)
- GitHub Check: Build wheels for cp311-manylinux_x86_64
- GitHub Check: Build C++ (cuda120, cuda)
- GitHub Check: Test C++ (false)
- GitHub Check: Analyze (c-cpp)
- GitHub Check: Test Python (1, 3.12)
- GitHub Check: Build C library (2.14, >=2.5.0rc0,<2.15, libdeepmd_c_cu11.tar.gz)
- GitHub Check: Build C++ (cuda, cuda)
- GitHub Check: Test C++ (true)
- GitHub Check: Test Python (1, 3.9)
- GitHub Check: Build C library (2.18, libdeepmd_c.tar.gz)
- GitHub Check: Build C++ (cpu, cpu)
🔇 Additional comments (4)
deepmd/pt/train/training.py (4)
5-7: Import change for new cycle_iterator implementationThe addition of the
Iterabletype fromcollections.abcis appropriate for the type hinting in the new cycling iterator function.
164-179: Well-implemented infinite cycling iterator with proper documentationThe new
cycle_iteratorfunction elegantly replaces the previousBufferedIteratorimplementation, creating an infinite iterator that never raisesStopIteration. This addresses the garbage collection issues mentioned in the PR objective.Good practices used:
- Clear docstring explaining functionality and return values
- Using
with torch.device("cpu")to ensure iterator creation happens on CPU- Using
yield fromfor efficient delegation to the underlying iterator
181-199: DataLoader configuration enhancementThe renamed function now properly reflects its purpose, and the addition of
pin_memory=Trueis a good optimization for GPU training that can speed up data transfer.
201-224: Variable naming consistency for iterator patternThe variable names have been updated to match the new implementation pattern, consistently using
*_iterinstead of the previous buffer references.
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## devel #4737 +/- ##
=======================================
Coverage 84.69% 84.70%
=======================================
Files 697 697
Lines 67473 67425 -48
Branches 3541 3540 -1
=======================================
- Hits 57145 57111 -34
+ Misses 9196 9183 -13
+ Partials 1132 1131 -1 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
This pull request refactors the data loading mechanism in the training module by replacing the
BufferedIteratorclass with a simplercycle_iteratorfunction.I was able to reproduce this error on a system of Ni6Fe10 provided by @iProzd . The error log shows there is something strange when running garbage collection. I've tried the following approaches:
NUM_WORKERS=0: workedgc.collect()when a DataLoader finishes its epoch: worked, but ~10% slowerError log
The problem couples with pytorch tensor and python threads and pipes, which is hard to locate the root cause.
I tested this PR on single-task training and multi-task training, and the training speed (in s/1000 steps) is almost the same:
Fix #4586
Summary by CodeRabbit