Skip to content

Conversation

@caic99
Copy link
Member

@caic99 caic99 commented May 13, 2025

This pull request refactors the data loading mechanism in the training module by replacing the BufferedIterator class with a simpler cycle_iterator function.

I was able to reproduce this error on a system of Ni6Fe10 provided by @iProzd . The error log shows there is something strange when running garbage collection. I've tried the following approaches:

  • Not using GPU for training, but CPU: failed
  • Set NUM_WORKERS=0: worked
  • Manually run garbage collection gc.collect() when a DataLoader finishes its epoch: worked, but ~10% slower
Error log

Fatal Python error: Aborted

Thread 0x00007f68b9289700 (most recent call first):
  File "/root/miniconda3/lib/python3.10/threading.py", line 320 in wait
  File "/root/miniconda3/lib/python3.10/multiprocessing/queues.py", line 231 in _feed
  File "/root/miniconda3/lib/python3.10/threading.py", line 953 in run
  File "/root/miniconda3/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
  File "/root/miniconda3/lib/python3.10/threading.py", line 973 in _bootstrap

Current thread 0x00007f6c595d7280 (most recent call first):
  Garbage-collecting
  File "/root/miniconda3/lib/python3.10/ast.py", line 99 in _convert
  File "/root/miniconda3/lib/python3.10/ast.py", line 110 in literal_eval
  File "/root/miniconda3/lib/python3.10/site-packages/numpy/lib/utils.py", line 1078 in safe_eval
  File "/root/miniconda3/lib/python3.10/site-packages/numpy/lib/format.py", line 623 in _read_array_header
  File "/root/miniconda3/lib/python3.10/site-packages/numpy/lib/format.py", line 784 in read_array
  File "/root/miniconda3/lib/python3.10/site-packages/numpy/lib/npyio.py", line 456 in load
  File "/aisi/cc/deepmd-kit/deepmd/utils/path.py", line 187 in load_numpy
  File "/aisi/cc/deepmd-kit/deepmd/utils/data.py", line 634 in _load_data
  File "/aisi/cc/deepmd-kit/deepmd/utils/data.py", line 526 in _load_set
  File "/aisi/cc/deepmd-kit/deepmd/utils/data.py", line 251 in get_item_torch
  File "/aisi/cc/deepmd-kit/deepmd/pt/utils/dataset.py", line 39 in __getitem__
  File "/root/miniconda3/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 52 in <listcomp>
  File "/root/miniconda3/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 52 in fetch
  File "/root/miniconda3/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 789 in _next_data
  File "/root/miniconda3/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 733 in __next__
  File "/aisi/cc/deepmd-kit/deepmd/pt/utils/dataloader.py", line 204 in __getitem__
  File "/root/miniconda3/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 54 in fetch
  File "/root/miniconda3/lib/python3.10/site-packages/torch/utils/data/_utils/worker.py", line 349 in _worker_loop
  File "/root/miniconda3/lib/python3.10/multiprocessing/process.py", line 108 in run
  File "/root/miniconda3/lib/python3.10/multiprocessing/process.py", line 314 in _bootstrap
  File "/root/miniconda3/lib/python3.10/multiprocessing/popen_fork.py", line 71 in _launch
  File "/root/miniconda3/lib/python3.10/multiprocessing/popen_fork.py", line 19 in __init__
  File "/root/miniconda3/lib/python3.10/multiprocessing/context.py", line 281 in _Popen
  File "/root/miniconda3/lib/python3.10/multiprocessing/context.py", line 224 in _Popen
  File "/root/miniconda3/lib/python3.10/multiprocessing/process.py", line 121 in start
  File "/root/miniconda3/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1171 in __init__
  File "/root/miniconda3/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 424 in _get_iterator
  File "/root/miniconda3/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 493 in __iter__
  File "/aisi/cc/deepmd-kit/deepmd/pt/train/training.py", line 1075 in get_data
  File "/aisi/cc/deepmd-kit/deepmd/pt/train/training.py", line 689 in step
  File "/aisi/cc/deepmd-kit/deepmd/pt/train/training.py", line 960 in run
  File "/aisi/cc/deepmd-kit/deepmd/pt/entrypoints/main.py", line 361 in train
  File "/aisi/cc/deepmd-kit/deepmd/pt/entrypoints/main.py", line 530 in main
  File "/root/miniconda3/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355 in wrapper
  File "/aisi/cc/deepmd-kit/deepmd/main.py", line 930 in main
  File "/root/miniconda3/bin/dp", line 8 in <module>

Extension modules: numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, torch._C, torch._C._dynamo.autograd_compiler, torch._C._dynamo.eval_frame, torch._C._dynamo.guards, torch._C._dynamo.utils, torch._C._fft, torch._C._linalg, torch._C._nested, torch._C._nn, torch._C._sparse, torch._C._special, h5py._errors, h5py.defs, h5py._objects, h5py.h5, h5py.utils, h5py.h5t, h5py.h5s, h5py.h5ac, h5py.h5p, h5py.h5r, h5py._proxy, h5py._conv, h5py.h5z, h5py.h5a, h5py.h5d, h5py.h5ds, h5py.h5g, h5py.h5i, h5py.h5o, h5py.h5f, h5py.h5fd, h5py.h5pl, h5py.h5l, h5py._selector, yaml._yaml, scipy._lib._ccallback_c, scipy.special._ufuncs_cxx, scipy.special._ufuncs, scipy.special._specfun, scipy.special._comb, scipy.linalg._fblas, scipy.linalg._flapack, scipy.linalg.cython_lapack, scipy.linalg._cythonized_array_utils, scipy.linalg._solve_toeplitz, scipy.linalg._decomp_lu_cython, scipy.linalg._matfuncs_sqrtm_triu, scipy.linalg._matfuncs_expm, scipy.linalg._linalg_pythran, scipy.linalg.cython_blas, scipy.linalg._decomp_update, scipy.sparse._sparsetools, _csparsetools, scipy.sparse._csparsetools, scipy.sparse.linalg._dsolve._superlu, scipy.sparse.linalg._eigen.arpack._arpack, scipy.sparse.linalg._propack._spropack, scipy.sparse.linalg._propack._dpropack, scipy.sparse.linalg._propack._cpropack, scipy.sparse.linalg._propack._zpropack, scipy.sparse.csgraph._tools, scipy.sparse.csgraph._shortest_path, scipy.sparse.csgraph._traversal, scipy.sparse.csgraph._min_spanning_tree, scipy.sparse.csgraph._flow, scipy.sparse.csgraph._matching, scipy.sparse.csgraph._reordering, scipy.special._ellip_harm_2, scipy.interpolate._fitpack, scipy.interpolate._dfitpack, scipy.optimize._group_columns, scipy._lib.messagestream, scipy.optimize._trlib._trlib, scipy.optimize._lbfgsb, _moduleTNC, scipy.optimize._moduleTNC, scipy.optimize._cobyla, scipy.optimize._slsqp, scipy.optimize._minpack, scipy.optimize._lsq.givens_elimination, scipy.optimize._zeros, scipy.optimize._cython_nnls, scipy._lib._uarray._uarray, scipy.linalg._decomp_interpolative, scipy.optimize._bglu_dense, scipy.optimize._lsap, scipy.spatial._ckdtree, scipy.spatial._qhull, scipy.spatial._voronoi, scipy.spatial._distance_wrap, scipy.spatial._hausdorff, scipy.spatial.transform._rotation, scipy.optimize._direct, scipy.interpolate._dierckx, scipy.interpolate._ppoly, scipy.interpolate._interpnd, scipy.interpolate._rbfinterp_pythran, scipy.interpolate._rgi_cython, scipy.interpolate._bspl (total: 113)
Traceback (most recent call last):
  File "/root/miniconda3/bin/dp", line 8, in <module>
    sys.exit(main())
  File "/aisi/cc/deepmd-kit/deepmd/main.py", line 930, in main
    deepmd_main(args)
  File "/root/miniconda3/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper
    return f(*args, **kwargs)
  File "/aisi/cc/deepmd-kit/deepmd/pt/entrypoints/main.py", line 530, in main
    train(
  File "/aisi/cc/deepmd-kit/deepmd/pt/entrypoints/main.py", line 361, in train
    trainer.run()
  File "/aisi/cc/deepmd-kit/deepmd/pt/train/training.py", line 960, in run
    step(step_id)
  File "/aisi/cc/deepmd-kit/deepmd/pt/train/training.py", line 705, in step
    loss.backward()
  File "/root/miniconda3/lib/python3.10/site-packages/torch/_tensor.py", line 648, in backward
    torch.autograd.backward(
  File "/root/miniconda3/lib/python3.10/site-packages/torch/autograd/__init__.py", line 353, in backward
    _engine_run_backward(
  File "/root/miniconda3/lib/python3.10/site-packages/torch/autograd/graph.py", line 824, in _engine_run_backward
    return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
  File "/root/miniconda3/lib/python3.10/site-packages/torch/utils/data/_utils/signal_handling.py", line 73, in handler
    _error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 434731) is killed by signal: Aborted. 

The problem couples with pytorch tensor and python threads and pipes, which is hard to locate the root cause.

I tested this PR on single-task training and multi-task training, and the training speed (in s/1000 steps) is almost the same:

data before after
omat 235.52 235.89
multi-task pretraining 290.63 290.00

Fix #4586

Summary by CodeRabbit

  • Refactor
    • Enhanced data loading with an infinite cycling iterator for uninterrupted batch retrieval during training and validation.
    • Removed background prefetching and threading to simplify data loading utilities.
  • Style
    • Added a clarifying comment about shuffling behavior when distributed sampling is active.

Copilot AI review requested due to automatic review settings May 13, 2025 03:35
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR refactors the data loading mechanism by removing BufferedIterator and replacing it with a simpler cycle_iterator to streamline iterator behavior during training. Key changes include:

  • Removal of the BackgroundConsumer and BufferedIterator classes in dataloader.py.
  • Replacement of BufferedIterator with cycle_iterator in training.py, along with corresponding function renaming.
  • Minor format and comment updates for clarity regarding shuffling in distributed settings.

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
deepmd/pt/utils/dataloader.py Removed BufferedIterator and its background thread consumer; adjusted shuffle logic.
deepmd/pt/train/training.py Updated data loader function to use cycle_iterator and replaced BufferedIterator usage.
Comments suppressed due to low confidence (1)

deepmd/pt/train/training.py:1075

  • Now that BufferedIterator is replaced with cycle_iterator, please confirm that using next(iterator) without a stop condition is intentional and that the infinite cycling behavior is well documented for future maintainers.
batch_data = next(iterator)

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Signed-off-by: Chun Cai <amoycaic@gmail.com>
@caic99 caic99 requested review from iProzd and njzjz May 13, 2025 03:36
@coderabbitai
Copy link
Contributor

coderabbitai bot commented May 13, 2025

📝 Walkthrough

"""

Walkthrough

The changes replace the previous buffered iterator mechanism for data loading with a new infinite cycling iterator. The new approach simplifies data fetching in training by continuously yielding batches without raising StopIteration, eliminating explicit iterator recreation and background threading. The buffered iterator and its supporting classes are removed from the codebase.

Changes

File(s) Change Summary
deepmd/pt/train/training.py Replaced buffered iterator with infinite cycling iterator for data loading; updated data fetching logic in Trainer; removed all references to BufferedIterator.
deepmd/pt/utils/dataloader.py Removed BufferedIterator and related background prefetching/threading logic; removed BackgroundConsumer class; added comment clarifying shuffling behavior; updated WeightedRandomSampler to use explicit torch.Generator.
source/tests/pt/model/test_saveload_dpa1.py, source/tests/pt/model/test_saveload_se_e2_a.py Removed BufferedIterator usage in tests; replaced with infinite cycling iterator; simplified data fetching logic by removing exception handling and iterator recreation.

Sequence Diagram(s)

sequenceDiagram
    participant Trainer
    participant DataLoader

    loop Training Loop
        Trainer->>DataLoader: next(batch) via cycle_iterator
        DataLoader-->>Trainer: batch
        Trainer->>Trainer: process batch
    end
Loading

Assessment against linked issues

Objective Addressed Explanation
Fix nan rmse values and CUDA initialization error during training of polar model with PyTorch backend (#4586)

Suggested reviewers

  • wanghan-iapcm
  • njzjz
    """

Tip

⚡️ Faster reviews with caching
  • CodeRabbit now supports caching for code and dependencies, helping speed up reviews. This means quicker feedback, reduced wait times, and a smoother review experience overall. Cached data is encrypted and stored securely. This feature will be automatically enabled for all accounts on May 16th. To opt out, configure Review - Disable Cache at either the organization or repository level. If you prefer to disable all data retention across your organization, simply turn off the Data Retention setting under your Organization Settings.

Enjoy the performance boost—your workflow just got faster.


📜 Recent review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b6b4529 and 42dcf4a.

📒 Files selected for processing (1)
  • deepmd/pt/utils/dataloader.py (3 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • deepmd/pt/utils/dataloader.py
⏰ Context from checks skipped due to timeout of 90000ms (29)
  • GitHub Check: Test Python (6, 3.12)
  • GitHub Check: Test Python (6, 3.9)
  • GitHub Check: Test Python (5, 3.12)
  • GitHub Check: Test Python (5, 3.9)
  • GitHub Check: Test Python (4, 3.12)
  • GitHub Check: Test Python (4, 3.9)
  • GitHub Check: Build wheels for cp310-manylinux_aarch64
  • GitHub Check: Test Python (3, 3.12)
  • GitHub Check: Build wheels for cp311-win_amd64
  • GitHub Check: Build C++ (clang, clang)
  • GitHub Check: Test Python (3, 3.9)
  • GitHub Check: Build wheels for cp311-macosx_arm64
  • GitHub Check: Build C++ (rocm, rocm)
  • GitHub Check: Build wheels for cp311-macosx_x86_64
  • GitHub Check: Test Python (2, 3.12)
  • GitHub Check: Build C++ (cuda120, cuda)
  • GitHub Check: Test Python (2, 3.9)
  • GitHub Check: Build wheels for cp311-manylinux_x86_64
  • GitHub Check: Build C++ (cuda, cuda)
  • GitHub Check: Build C library (2.14, >=2.5.0rc0,<2.15, libdeepmd_c_cu11.tar.gz)
  • GitHub Check: Analyze (python)
  • GitHub Check: Test Python (1, 3.12)
  • GitHub Check: Build wheels for cp311-manylinux_x86_64
  • GitHub Check: Build C++ (cpu, cpu)
  • GitHub Check: Build C library (2.18, libdeepmd_c.tar.gz)
  • GitHub Check: Test C++ (false)
  • GitHub Check: Test Python (1, 3.9)
  • GitHub Check: Analyze (c-cpp)
  • GitHub Check: Test C++ (true)
✨ Finishing Touches
  • 📝 Generate Docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Explain this complex logic.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai explain this code block.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and explain its main purpose.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
deepmd/pt/train/training.py (1)

162-174: Docstring added – implementation looks good

The new cycle_iterator does exactly what is required (an infinite stream without buffering the whole iterable as itertools.cycle would). The docstring added in this revision addresses the documentation gap raised in the previous review.

🧹 Nitpick comments (3)
deepmd/pt/utils/dataloader.py (1)

169-172: Clarify and simplify the shuffle expression

The current multi-line expression is hard to read and quite easy to mis-interpret because the comment splits the boolean logic:

shuffle=(
    not (dist.is_available() and dist.is_initialized())
)  # ...
and shuffle,
  1. Moving the comment out of the expression and writing the condition in a single line improves readability.
  2. The intent ( “disable explicit shuffle when a distributed sampler is present” ) can be conveyed more clearly.
-                shuffle=(
-                    not (dist.is_available() and dist.is_initialized())
-                )  # distributed sampler will do the shuffling by default
-                and shuffle,
+                # Disable explicit shuffling when a distributed sampler is used;
+                # otherwise respect the user-supplied `shuffle` flag
+                shuffle=shuffle and not (dist.is_available() and dist.is_initialized()),

No functional change – just safer, cleaner code.

deepmd/pt/train/training.py (2)

194-206: Consider turning on persistent_workers for higher worker counts

cycle_iterator repeatedly calls iter(dataloader).
With num_workers > 0, a new _MultiProcessingDataLoaderIter is returned each time, but the worker pool is re-used only if persistent_workers=True. Without this flag PyTorch tears down and re-spawns worker processes at every cycle, which can noticeably hurt performance for small batch sizes.

-                _dataloader = DataLoader(
+                _dataloader = DataLoader(
                     _data,
                     sampler=_sampler,
                     batch_size=None,
                     num_workers=NUM_WORKERS if dist.is_available() else 0,
                     drop_last=False,
                     collate_fn=lambda batch: batch,
                     pin_memory=True,
+                    persistent_workers=NUM_WORKERS > 0,
                 )

Semantics stay identical; only CPU/GPU utilisation improves.


1080-1083: Minor style – one-liner ternary is clearer

Ruff’s SIM108 hint is reasonable here:

-        if is_train:
-            iterator = self.training_data
-        else:
-            iterator = self.validation_data
+        iterator = self.training_data if is_train else self.validation_data

Purely aesthetic – feel free to ignore if you prefer the current style.

🧰 Tools
🪛 Ruff (0.8.2)

1080-1083: Use ternary operator iterator = self.training_data if is_train else self.validation_data instead of if-else-block

Replace if-else-block with iterator = self.training_data if is_train else self.validation_data

(SIM108)

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a31b886 and 3417fe3.

📒 Files selected for processing (2)
  • deepmd/pt/train/training.py (4 hunks)
  • deepmd/pt/utils/dataloader.py (1 hunks)
🧰 Additional context used
🪛 Ruff (0.8.2)
deepmd/pt/train/training.py

1080-1083: Use ternary operator iterator = self.training_data if is_train else self.validation_data instead of if-else-block

Replace if-else-block with iterator = self.training_data if is_train else self.validation_data

(SIM108)

⏰ Context from checks skipped due to timeout of 90000ms (29)
  • GitHub Check: Test C++ (false)
  • GitHub Check: Test C++ (true)
  • GitHub Check: Test Python (6, 3.12)
  • GitHub Check: Test Python (6, 3.9)
  • GitHub Check: Test Python (5, 3.12)
  • GitHub Check: Test Python (5, 3.9)
  • GitHub Check: Test Python (4, 3.12)
  • GitHub Check: Test Python (4, 3.9)
  • GitHub Check: Test Python (3, 3.12)
  • GitHub Check: Test Python (3, 3.9)
  • GitHub Check: Analyze (python)
  • GitHub Check: Test Python (2, 3.12)
  • GitHub Check: Test Python (2, 3.9)
  • GitHub Check: Analyze (c-cpp)
  • GitHub Check: Test Python (1, 3.12)
  • GitHub Check: Build wheels for cp310-manylinux_aarch64
  • GitHub Check: Test Python (1, 3.9)
  • GitHub Check: Build wheels for cp311-win_amd64
  • GitHub Check: Build C library (2.14, >=2.5.0rc0,<2.15, libdeepmd_c_cu11.tar.gz)
  • GitHub Check: Build wheels for cp311-macosx_arm64
  • GitHub Check: Build C library (2.18, libdeepmd_c.tar.gz)
  • GitHub Check: Build wheels for cp311-macosx_x86_64
  • GitHub Check: Build wheels for cp311-manylinux_x86_64
  • GitHub Check: Build C++ (clang, clang)
  • GitHub Check: Build C++ (rocm, rocm)
  • GitHub Check: Build wheels for cp311-manylinux_x86_64
  • GitHub Check: Build C++ (cuda120, cuda)
  • GitHub Check: Build C++ (cuda, cuda)
  • GitHub Check: Build C++ (cpu, cpu)

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🔭 Outside diff range comments (1)
deepmd/pt/train/training.py (1)

176-196: ⚠️ Potential issue

with torch.device("cpu") is not a context manager & collate_fn now breaks downstream code

Two issues introduced in this block will raise runtime errors or silently change behaviour:

  1. torch.device returns a plain torch.device object; it does not implement __enter__/__exit__. Using it in a with statement will raise AttributeError: __enter__.

  2. collate_fn=lambda batch: batch makes the DataLoader yield a list of length-1 instead of the expected dict. A few lines below (batch_data.keys()) will then crash with
    AttributeError: 'list' object has no attribute 'keys'.

Fix both issues:

                 _dataloader = DataLoader(
                     _data,
                     sampler=_sampler,
                     batch_size=None,
@@
-                    collate_fn=lambda batch: batch,  # prevent extra conversion
+                    # Return the single element so that `get_data` receives a dict
+                    collate_fn=lambda batch: batch[0],
                     pin_memory=True,
                 )
-                with torch.device("cpu"):
-                    _data_iter = cycle_iterator(_dataloader)
+                # Iterator lives on CPU; no context manager is required here
+                _data_iter = cycle_iterator(_dataloader)
                 return _dataloader, _data_iter

Without this patch training will fail the first time the loader is iterated.

🧹 Nitpick comments (2)
deepmd/pt/train/training.py (2)

162-174: Docstring is great – consider adding minimal type hints for clarity

The helper is small and clear, but because it is now a central piece of the new data-loading strategy it may be worth giving readers a precise idea of what goes in and what comes out:

-        def cycle_iterator(iterable: Iterable):
+        from typing import Iterator, TypeVar
+        T = TypeVar("T")
+
+        def cycle_iterator(iterable: Iterable[T]) -> Iterator[T]:

Totally optional, yet this 3-line change makes mypy/IDE autocompletion happier and documents intent in code instead of prose.


1080-1089: Minor style: can be flattened & protect against empty batches

A ternary keeps the intent short (ruff hint SIM108):

-        if is_train:
-            iterator = self.training_data
-        else:
-            iterator = self.validation_data
+        iterator = self.training_data if is_train else self.validation_data

Additionally, consider an early guard:

batch_data = next(iterator, None)
if batch_data is None:   # iterator exhausted unexpectedly
    return {}, {}, {}

Not critical (the infinite cycling iterator should avoid exhaustion) but gives a graceful exit if something upstream changes.

🧰 Tools
🪛 Ruff (0.8.2)

1080-1083: Use ternary operator iterator = self.training_data if is_train else self.validation_data instead of if-else-block

Replace if-else-block with iterator = self.training_data if is_train else self.validation_data

(SIM108)


1089-1089: Use key in dict instead of key in dict.keys()

Remove .keys()

(SIM118)

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 3417fe3 and c224e80.

📒 Files selected for processing (1)
  • deepmd/pt/train/training.py (4 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (1)
deepmd/pt/train/training.py (1)
deepmd/pd/train/training.py (1)
  • get_data_loader (156-201)
🪛 Ruff (0.8.2)
deepmd/pt/train/training.py

1080-1083: Use ternary operator iterator = self.training_data if is_train else self.validation_data instead of if-else-block

Replace if-else-block with iterator = self.training_data if is_train else self.validation_data

(SIM108)

⏰ Context from checks skipped due to timeout of 90000ms (30)
  • GitHub Check: Analyze (python)
  • GitHub Check: Build wheels for cp310-manylinux_aarch64
  • GitHub Check: Analyze (javascript-typescript)
  • GitHub Check: Build wheels for cp311-win_amd64
  • GitHub Check: Analyze (c-cpp)
  • GitHub Check: Build wheels for cp311-macosx_arm64
  • GitHub Check: Build wheels for cp311-macosx_x86_64
  • GitHub Check: Build wheels for cp311-manylinux_x86_64
  • GitHub Check: Build wheels for cp311-manylinux_x86_64
  • GitHub Check: Test C++ (false)
  • GitHub Check: Test C++ (true)
  • GitHub Check: Build C library (2.14, >=2.5.0rc0,<2.15, libdeepmd_c_cu11.tar.gz)
  • GitHub Check: Build C library (2.18, libdeepmd_c.tar.gz)
  • GitHub Check: Test Python (6, 3.12)
  • GitHub Check: Test Python (6, 3.9)
  • GitHub Check: Test Python (5, 3.12)
  • GitHub Check: Test Python (5, 3.9)
  • GitHub Check: Test Python (4, 3.12)
  • GitHub Check: Test Python (4, 3.9)
  • GitHub Check: Test Python (3, 3.12)
  • GitHub Check: Test Python (3, 3.9)
  • GitHub Check: Build C++ (clang, clang)
  • GitHub Check: Test Python (2, 3.12)
  • GitHub Check: Build C++ (rocm, rocm)
  • GitHub Check: Test Python (2, 3.9)
  • GitHub Check: Build C++ (cuda120, cuda)
  • GitHub Check: Test Python (1, 3.12)
  • GitHub Check: Build C++ (cuda, cuda)
  • GitHub Check: Test Python (1, 3.9)
  • GitHub Check: Build C++ (cpu, cpu)

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
deepmd/pt/train/training.py (1)

1083-1091: Simplify the iterator selection with a ternary operator.

The code can be more concise by using a ternary operator for the iterator selection.

-        if is_train:
-            iterator = self.training_data
-        else:
-            iterator = self.validation_data
+        iterator = self.training_data if is_train else self.validation_data

The handling of None iterator and multi-task scenario is well implemented and should correctly handle the case when no validation iterator is present.

🧰 Tools
🪛 Ruff (0.8.2)

1083-1086: Use ternary operator iterator = self.training_data if is_train else self.validation_data instead of if-else-block

Replace if-else-block with iterator = self.training_data if is_train else self.validation_data

(SIM108)

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c224e80 and 6c65bba.

📒 Files selected for processing (1)
  • deepmd/pt/train/training.py (4 hunks)
🧰 Additional context used
🪛 Ruff (0.8.2)
deepmd/pt/train/training.py

1083-1086: Use ternary operator iterator = self.training_data if is_train else self.validation_data instead of if-else-block

Replace if-else-block with iterator = self.training_data if is_train else self.validation_data

(SIM108)

⏰ Context from checks skipped due to timeout of 90000ms (29)
  • GitHub Check: Test Python (6, 3.12)
  • GitHub Check: Test Python (6, 3.9)
  • GitHub Check: Test Python (5, 3.12)
  • GitHub Check: Build wheels for cp310-manylinux_aarch64
  • GitHub Check: Test Python (5, 3.9)
  • GitHub Check: Build wheels for cp311-win_amd64
  • GitHub Check: Test Python (4, 3.12)
  • GitHub Check: Build wheels for cp311-macosx_arm64
  • GitHub Check: Test Python (4, 3.9)
  • GitHub Check: Build wheels for cp311-macosx_x86_64
  • GitHub Check: Test Python (3, 3.12)
  • GitHub Check: Build wheels for cp311-manylinux_x86_64
  • GitHub Check: Test Python (3, 3.9)
  • GitHub Check: Build wheels for cp311-manylinux_x86_64
  • GitHub Check: Test Python (2, 3.12)
  • GitHub Check: Test Python (2, 3.9)
  • GitHub Check: Build C++ (clang, clang)
  • GitHub Check: Test Python (1, 3.9)
  • GitHub Check: Build C++ (rocm, rocm)
  • GitHub Check: Test C++ (false)
  • GitHub Check: Build C++ (cuda120, cuda)
  • GitHub Check: Test C++ (true)
  • GitHub Check: Build C++ (cuda, cuda)
  • GitHub Check: Build C++ (cpu, cpu)
  • GitHub Check: Build C library (2.14, >=2.5.0rc0,<2.15, libdeepmd_c_cu11.tar.gz)
  • GitHub Check: Analyze (python)
  • GitHub Check: Analyze (javascript-typescript)
  • GitHub Check: Build C library (2.18, libdeepmd_c.tar.gz)
  • GitHub Check: Analyze (c-cpp)
🔇 Additional comments (4)
deepmd/pt/train/training.py (4)

5-7: Import correctly added for the Iterable type.

The addition of the Iterable import from collections.abc properly supports the type annotation in the new cycle_iterator function.


164-177: Good implementation of the cycle_iterator with complete docstring.

The newly added cycle_iterator function provides a clean and simple way to create an infinite iterator that cycles through the given iterable. The implementation using yield from is elegant and efficient. The function has proper documentation including argument types and return values.

This is a good replacement for the previous BufferedIterator class that simplifies the data loading mechanism and avoids the problematic behavior during garbage collection and multiprocessing.


179-198: Function properly renamed and refactored to use cycle_iterator.

The function has been correctly renamed from get_dataloader_and_buffer to get_dataloader_and_iter to reflect the change in implementation. The returned iterator is now created using the new cycle_iterator function instead of the previous BufferedIterator.


200-223: Variable names consistently updated to reflect the implementation change.

All relevant variable names have been updated from references to "buffer" to references to "iter", maintaining consistency throughout the code.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
source/tests/pt/model/test_saveload_se_e2_a.py (1)

75-78: Consider importing cycle_iterator instead of defining it locally

The cycle_iterator function is well-implemented as an infinite generator, but it's also defined in deepmd/pt/train/training.py with proper typing and documentation. Consider importing it from there to avoid code duplication and ensure consistent behavior across the codebase.

- def cycle_iterator(iterable):
-     while True:
-         yield from iterable

+ from deepmd.pt.train.training import cycle_iterator
source/tests/pt/model/test_saveload_dpa1.py (1)

75-78: Consider importing cycle_iterator instead of defining it locally

The cycle_iterator function is implemented identically in both test files. To follow the DRY principle (Don't Repeat Yourself), consider importing it from deepmd/pt/train/training.py where it's already defined with proper typing and documentation.

- def cycle_iterator(iterable):
-     while True:
-         yield from iterable

+ from deepmd.pt.train.training import cycle_iterator
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6c65bba and ae84b7e.

📒 Files selected for processing (2)
  • source/tests/pt/model/test_saveload_dpa1.py (2 hunks)
  • source/tests/pt/model/test_saveload_se_e2_a.py (2 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (2)
source/tests/pt/model/test_saveload_se_e2_a.py (2)
source/tests/pt/model/test_saveload_dpa1.py (1)
  • cycle_iterator (75-77)
deepmd/pt/train/training.py (1)
  • cycle_iterator (164-176)
source/tests/pt/model/test_saveload_dpa1.py (2)
source/tests/pt/model/test_saveload_se_e2_a.py (1)
  • cycle_iterator (75-77)
deepmd/pt/train/training.py (1)
  • cycle_iterator (164-176)
⏰ Context from checks skipped due to timeout of 90000ms (29)
  • GitHub Check: Build C++ (clang, clang)
  • GitHub Check: Build C++ (rocm, rocm)
  • GitHub Check: Build C++ (cuda120, cuda)
  • GitHub Check: Build C++ (cuda, cuda)
  • GitHub Check: Build C++ (cpu, cpu)
  • GitHub Check: Test Python (6, 3.12)
  • GitHub Check: Test Python (6, 3.9)
  • GitHub Check: Test Python (5, 3.12)
  • GitHub Check: Test Python (5, 3.9)
  • GitHub Check: Test Python (4, 3.12)
  • GitHub Check: Test Python (4, 3.9)
  • GitHub Check: Test Python (3, 3.12)
  • GitHub Check: Test Python (3, 3.9)
  • GitHub Check: Test Python (2, 3.12)
  • GitHub Check: Test Python (2, 3.9)
  • GitHub Check: Test Python (1, 3.12)
  • GitHub Check: Build wheels for cp310-manylinux_aarch64
  • GitHub Check: Test Python (1, 3.9)
  • GitHub Check: Build wheels for cp311-win_amd64
  • GitHub Check: Build wheels for cp311-macosx_arm64
  • GitHub Check: Build wheels for cp311-macosx_x86_64
  • GitHub Check: Build wheels for cp311-manylinux_x86_64
  • GitHub Check: Build wheels for cp311-manylinux_x86_64
  • GitHub Check: Test C++ (false)
  • GitHub Check: Test C++ (true)
  • GitHub Check: Build C library (2.14, >=2.5.0rc0,<2.15, libdeepmd_c_cu11.tar.gz)
  • GitHub Check: Build C library (2.18, libdeepmd_c.tar.gz)
  • GitHub Check: Analyze (python)
  • GitHub Check: Analyze (c-cpp)
🔇 Additional comments (4)
source/tests/pt/model/test_saveload_se_e2_a.py (2)

80-80: LGTM! Proper replacement of BufferedIterator with cycle_iterator

This change correctly implements the new infinite cycling iterator pattern, which aligns with the PR objective to fix issues related to garbage collection with PyTorch DataLoader workers.


112-112: LGTM! Simplified data retrieval

The simplified get_data method now directly calls next on the infinite iterator without needing to handle StopIteration exceptions or reinitialize the iterator. This is cleaner and more maintainable.

source/tests/pt/model/test_saveload_dpa1.py (2)

80-80: LGTM! Proper replacement of BufferedIterator with cycle_iterator

This change correctly implements the new infinite cycling iterator approach, which aligns with the PR objective to fix garbage collection issues with PyTorch DataLoader workers.


118-118: LGTM! Simplified data retrieval

The method now directly calls next on the infinite iterator without handling StopIteration exceptions. This simplification makes the code cleaner and aligns with the new data loading mechanism.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
deepmd/pt/train/training.py (2)

164-177: Well-implemented infinite cycling iterator with proper docstring.

This new cycle_iterator function elegantly replaces the previous BufferedIterator with a simpler approach that continuously cycles through the dataloader. The docstring is comprehensive and the use of with torch.device("cpu"): helps avoid potential GPU memory issues during iteration.

Two minor docstring format improvements:

  1. The "Yields" section uses dashes while "Args" doesn't - consider standardizing
  2. The return description could specify it's an infinite iterator
         def cycle_iterator(iterable: Iterable):
             """
             Produces an infinite iterator by repeatedly cycling through the given iterable.
             
             Args:
                 iterable (Iterable): The iterable to cycle through.
             
-            Yields
-            ------
-            Any: The next item from the iterable, cycling back to the beginning when the end is reached.
+            Yields:
+                Any: The next item from the iterable, cycling back to the beginning when the end is reached.
             """
             with torch.device("cpu"):
                 while True:
                     yield from iterable

1084-1092: Simplified data fetching with proper null checks.

The get_data method has been simplified by directly using the infinite cycling iterator without needing to handle StopIteration exceptions. The null check for iterator is None ensures the method gracefully handles the case where there's no iterator.

As identified by the static analyzer, you could further simplify this code with a ternary operator.

-    def get_data(self, is_train=True, task_key="Default"):
-        if is_train:
-            iterator = self.training_data
-        else:
-            iterator = self.validation_data
+    def get_data(self, is_train=True, task_key="Default"):
+        iterator = self.training_data if is_train else self.validation_data
         if self.multi_task:
             iterator = iterator[task_key]
         if iterator is None:
🧰 Tools
🪛 Ruff (0.8.2)

1084-1087: Use ternary operator iterator = self.training_data if is_train else self.validation_data instead of if-else-block

Replace if-else-block with iterator = self.training_data if is_train else self.validation_data

(SIM108)

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ae84b7e and 45e90ea.

📒 Files selected for processing (1)
  • deepmd/pt/train/training.py (4 hunks)
🧰 Additional context used
🪛 Ruff (0.8.2)
deepmd/pt/train/training.py

1084-1087: Use ternary operator iterator = self.training_data if is_train else self.validation_data instead of if-else-block

Replace if-else-block with iterator = self.training_data if is_train else self.validation_data

(SIM108)

⏰ Context from checks skipped due to timeout of 90000ms (29)
  • GitHub Check: Test Python (6, 3.12)
  • GitHub Check: Test Python (6, 3.9)
  • GitHub Check: Test Python (5, 3.12)
  • GitHub Check: Test Python (5, 3.9)
  • GitHub Check: Test Python (4, 3.12)
  • GitHub Check: Test Python (4, 3.9)
  • GitHub Check: Build wheels for cp310-manylinux_aarch64
  • GitHub Check: Test Python (3, 3.12)
  • GitHub Check: Build wheels for cp311-win_amd64
  • GitHub Check: Test Python (3, 3.9)
  • GitHub Check: Build C++ (clang, clang)
  • GitHub Check: Build wheels for cp311-macosx_arm64
  • GitHub Check: Test Python (2, 3.12)
  • GitHub Check: Build wheels for cp311-macosx_x86_64
  • GitHub Check: Build C++ (rocm, rocm)
  • GitHub Check: Test Python (2, 3.9)
  • GitHub Check: Build C++ (cuda120, cuda)
  • GitHub Check: Build wheels for cp311-manylinux_x86_64
  • GitHub Check: Analyze (python)
  • GitHub Check: Test Python (1, 3.12)
  • GitHub Check: Test C++ (false)
  • GitHub Check: Build wheels for cp311-manylinux_x86_64
  • GitHub Check: Build C++ (cuda, cuda)
  • GitHub Check: Build C library (2.14, >=2.5.0rc0,<2.15, libdeepmd_c_cu11.tar.gz)
  • GitHub Check: Test Python (1, 3.9)
  • GitHub Check: Build C++ (cpu, cpu)
  • GitHub Check: Test C++ (true)
  • GitHub Check: Build C library (2.18, libdeepmd_c.tar.gz)
  • GitHub Check: Analyze (c-cpp)
🔇 Additional comments (3)
deepmd/pt/train/training.py (3)

5-7: Appropriate import for the new iterator implementation.

The addition of Iterable from collections.abc is a proper type hint for the new cycling iterator functionality.


180-199: Function renamed to accurately reflect its new behavior.

The function has been appropriately renamed from get_dataloader_and_buffer to get_dataloader_and_iter to reflect that it now returns an infinite cycling iterator instead of a buffered iterator. The implementation correctly uses the new cycle_iterator function.


208-209: Consistent application of the new cycling iterator pattern.

The validation data iterator is now created using the same cycling iterator pattern as the training data, ensuring consistency in the implementation.

Also applies to: 217-217

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
deepmd/pt/train/training.py (1)

1084-1094: Significantly simplified data retrieval logic.

The get_data method has been greatly simplified by removing the need to handle StopIteration exceptions and iterator recreation. This addresses the core issue with garbage collection during multiprocessing.

Consider using a ternary operator for better readability:

-        if is_train:
-            iterator = self.training_data
-        else:
-            iterator = self.validation_data
+        iterator = self.training_data if is_train else self.validation_data
🧰 Tools
🪛 Ruff (0.8.2)

1085-1088: Use ternary operator iterator = self.training_data if is_train else self.validation_data instead of if-else-block

Replace if-else-block with iterator = self.training_data if is_train else self.validation_data

(SIM108)


1094-1094: Use key in dict instead of key in dict.keys()

Remove .keys()

(SIM118)

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 45e90ea and 171c510.

📒 Files selected for processing (1)
  • deepmd/pt/train/training.py (4 hunks)
🧰 Additional context used
🪛 Ruff (0.8.2)
deepmd/pt/train/training.py

1085-1088: Use ternary operator iterator = self.training_data if is_train else self.validation_data instead of if-else-block

Replace if-else-block with iterator = self.training_data if is_train else self.validation_data

(SIM108)

⏰ Context from checks skipped due to timeout of 90000ms (29)
  • GitHub Check: Test Python (6, 3.12)
  • GitHub Check: Test Python (6, 3.9)
  • GitHub Check: Test Python (5, 3.12)
  • GitHub Check: Test Python (5, 3.9)
  • GitHub Check: Test Python (4, 3.12)
  • GitHub Check: Build wheels for cp310-manylinux_aarch64
  • GitHub Check: Test Python (4, 3.9)
  • GitHub Check: Test Python (3, 3.12)
  • GitHub Check: Build wheels for cp311-win_amd64
  • GitHub Check: Build wheels for cp311-macosx_arm64
  • GitHub Check: Test Python (3, 3.9)
  • GitHub Check: Build wheels for cp311-macosx_x86_64
  • GitHub Check: Test Python (2, 3.12)
  • GitHub Check: Build wheels for cp311-manylinux_x86_64
  • GitHub Check: Build C++ (clang, clang)
  • GitHub Check: Test Python (2, 3.9)
  • GitHub Check: Build C library (2.14, >=2.5.0rc0,<2.15, libdeepmd_c_cu11.tar.gz)
  • GitHub Check: Build C++ (rocm, rocm)
  • GitHub Check: Build wheels for cp311-manylinux_x86_64
  • GitHub Check: Test Python (1, 3.12)
  • GitHub Check: Analyze (python)
  • GitHub Check: Test Python (1, 3.9)
  • GitHub Check: Test C++ (false)
  • GitHub Check: Build C++ (cuda120, cuda)
  • GitHub Check: Build C library (2.18, libdeepmd_c.tar.gz)
  • GitHub Check: Test C++ (true)
  • GitHub Check: Build C++ (cuda, cuda)
  • GitHub Check: Build C++ (cpu, cpu)
  • GitHub Check: Analyze (c-cpp)
🔇 Additional comments (7)
deepmd/pt/train/training.py (7)

5-7: Addition of Iterable type hint import is appropriate.

This import supports the type annotation in the new cycle_iterator function, ensuring proper static type checking.


164-179: Well-implemented infinite cycling iterator with clear documentation.

The cycle_iterator function elegantly solves the issue with BufferedIterator by providing a simpler mechanism that continuously cycles through data without raising StopIteration. The docstring is clear and the implementation is concise.

The with torch.device("cpu") context is a good defensive measure to ensure iterator creation happens on CPU, which may help prevent GPU memory issues during initialization.


181-182: Function name updated to reflect new implementation.

The function name change from get_dataloader_and_buffer to get_dataloader_and_iter accurately reflects its new implementation that returns a cycling iterator instead of a buffered iterator.


197-197: Good addition of pin_memory_device parameter.

Explicitly specifying the device for pinned memory improves clarity and ensures consistency with the intended device usage.


199-200: Replaced BufferedIterator with cycle_iterator.

This change is the core of the fix, eliminating the problematic BufferedIterator implementation that was causing garbage collection issues with a simpler solution.


209-210: Consistently updated validation data handling.

The validation data iterator creation now uses the same simplified approach as the training data, maintaining consistency throughout the code.

Also applies to: 218-218


1090-1093: Good defensive check for None iterator.

This code properly handles the case when no validation data is provided, gracefully returning empty dictionaries instead of raising an exception. This fix addresses a potential issue identified in previous reviews.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
deepmd/pt/train/training.py (1)

1082-1086: Consider using a ternary operator for iterator selection.

The if-else block could be simplified.

-        if is_train:
-            iterator = self.training_data
-        else:
-            iterator = self.validation_data
+        iterator = self.training_data if is_train else self.validation_data
🧰 Tools
🪛 Ruff (0.8.2)

1082-1085: Use ternary operator iterator = self.training_data if is_train else self.validation_data instead of if-else-block

Replace if-else-block with iterator = self.training_data if is_train else self.validation_data

(SIM108)

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 171c510 and 5d5c8e1.

📒 Files selected for processing (2)
  • deepmd/pt/train/training.py (4 hunks)
  • deepmd/pt/utils/dataloader.py (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • deepmd/pt/utils/dataloader.py
🧰 Additional context used
🪛 Ruff (0.8.2)
deepmd/pt/train/training.py

1082-1085: Use ternary operator iterator = self.training_data if is_train else self.validation_data instead of if-else-block

Replace if-else-block with iterator = self.training_data if is_train else self.validation_data

(SIM108)

⏰ Context from checks skipped due to timeout of 90000ms (29)
  • GitHub Check: Test Python (6, 3.12)
  • GitHub Check: Test Python (6, 3.9)
  • GitHub Check: Test Python (5, 3.12)
  • GitHub Check: Test Python (5, 3.9)
  • GitHub Check: Test Python (4, 3.12)
  • GitHub Check: Test Python (4, 3.9)
  • GitHub Check: Test Python (3, 3.12)
  • GitHub Check: Test Python (3, 3.9)
  • GitHub Check: Test Python (2, 3.12)
  • GitHub Check: Build wheels for cp310-manylinux_aarch64
  • GitHub Check: Test Python (2, 3.9)
  • GitHub Check: Build wheels for cp311-win_amd64
  • GitHub Check: Test Python (1, 3.12)
  • GitHub Check: Build wheels for cp311-macosx_arm64
  • GitHub Check: Build C++ (clang, clang)
  • GitHub Check: Test Python (1, 3.9)
  • GitHub Check: Build wheels for cp311-macosx_x86_64
  • GitHub Check: Build C++ (rocm, rocm)
  • GitHub Check: Build C++ (cuda120, cuda)
  • GitHub Check: Build wheels for cp311-manylinux_x86_64
  • GitHub Check: Build C++ (cuda, cuda)
  • GitHub Check: Analyze (python)
  • GitHub Check: Build wheels for cp311-manylinux_x86_64
  • GitHub Check: Test C++ (false)
  • GitHub Check: Build C++ (cpu, cpu)
  • GitHub Check: Build C library (2.14, >=2.5.0rc0,<2.15, libdeepmd_c_cu11.tar.gz)
  • GitHub Check: Test C++ (true)
  • GitHub Check: Analyze (c-cpp)
  • GitHub Check: Build C library (2.18, libdeepmd_c.tar.gz)
🔇 Additional comments (8)
deepmd/pt/train/training.py (8)

5-7: Proper import for iterator type hinting.

Good addition of the Iterable import from collections.abc for proper type hinting in the new cycling iterator implementation.


164-177: Well-documented cycle_iterator implementation.

The new cycle_iterator function is a clean, well-documented replacement for the previous BufferedIterator. It provides an infinite yielding pattern that avoids StopIteration exceptions, addressing the root cause of the garbage collection issues mentioned in the PR objectives.


179-180: Function renamed to better reflect its purpose.

Renaming from get_dataloader_and_buffer to get_dataloader_and_iter accurately reflects the implementation change from buffered iterators to cycling iterators.


196-197: Simplified iteration mechanism.

Replacing BufferedIterator with the simpler cycle_iterator eliminates the complex background prefetching thread, addressing the garbage collection issues when using PyTorch DataLoader with multiple threads and processes.


199-201: Consistent naming convention with implementation changes.

Variable names updated from buffer to iter consistently reflect the change in approach.


205-209: Consistent naming in validation data handling.

Updates to variable names maintain consistency throughout the codebase.


214-216: Proper handling for missing validation data.

Correctly initializes validation_data_iter as None when validation data is not provided.


1087-1091: Robust handling of None iterators.

The code correctly checks if the iterator is None before attempting to get the next batch, preventing the TypeError that was previously raised when no validation data was available.

This is a significant improvement that makes the code more robust and addresses an issue mentioned in the past review comments.

🧰 Tools
🪛 Ruff (0.8.2)

1091-1091: Use key in dict instead of key in dict.keys()

Remove .keys()

(SIM118)

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
deepmd/pt/train/training.py (1)

1083-1092: Simplified data fetching with proper null checking

The get_data method has been significantly simplified by removing the previous exception handling pattern. The addition of the null check on line 1090 ensures graceful handling when no validation data exists.

The conditional block could be simplified using a ternary operator:

-    if is_train:
-        iterator = self.training_data
-    else:
-        iterator = self.validation_data
+    iterator = self.training_data if is_train else self.validation_data
🧰 Tools
🪛 Ruff (0.8.2)

1084-1087: Use ternary operator iterator = self.training_data if is_train else self.validation_data instead of if-else-block

Replace if-else-block with iterator = self.training_data if is_train else self.validation_data

(SIM108)

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5d5c8e1 and d93c948.

📒 Files selected for processing (1)
  • deepmd/pt/train/training.py (4 hunks)
🧰 Additional context used
🪛 Ruff (0.8.2)
deepmd/pt/train/training.py

1084-1087: Use ternary operator iterator = self.training_data if is_train else self.validation_data instead of if-else-block

Replace if-else-block with iterator = self.training_data if is_train else self.validation_data

(SIM108)

⏰ Context from checks skipped due to timeout of 90000ms (29)
  • GitHub Check: Test Python (6, 3.12)
  • GitHub Check: Test Python (6, 3.9)
  • GitHub Check: Test Python (5, 3.12)
  • GitHub Check: Test Python (5, 3.9)
  • GitHub Check: Test Python (4, 3.12)
  • GitHub Check: Build wheels for cp310-manylinux_aarch64
  • GitHub Check: Test Python (4, 3.9)
  • GitHub Check: Build wheels for cp311-win_amd64
  • GitHub Check: Test Python (3, 3.12)
  • GitHub Check: Build wheels for cp311-macosx_arm64
  • GitHub Check: Build C++ (clang, clang)
  • GitHub Check: Test Python (3, 3.9)
  • GitHub Check: Build wheels for cp311-macosx_x86_64
  • GitHub Check: Analyze (python)
  • GitHub Check: Build C++ (rocm, rocm)
  • GitHub Check: Test Python (2, 3.12)
  • GitHub Check: Build wheels for cp311-manylinux_x86_64
  • GitHub Check: Test Python (2, 3.9)
  • GitHub Check: Build wheels for cp311-manylinux_x86_64
  • GitHub Check: Build C++ (cuda120, cuda)
  • GitHub Check: Test C++ (false)
  • GitHub Check: Analyze (c-cpp)
  • GitHub Check: Test Python (1, 3.12)
  • GitHub Check: Build C library (2.14, >=2.5.0rc0,<2.15, libdeepmd_c_cu11.tar.gz)
  • GitHub Check: Build C++ (cuda, cuda)
  • GitHub Check: Test C++ (true)
  • GitHub Check: Test Python (1, 3.9)
  • GitHub Check: Build C library (2.18, libdeepmd_c.tar.gz)
  • GitHub Check: Build C++ (cpu, cpu)
🔇 Additional comments (4)
deepmd/pt/train/training.py (4)

5-7: Import change for new cycle_iterator implementation

The addition of the Iterable type from collections.abc is appropriate for the type hinting in the new cycling iterator function.


164-179: Well-implemented infinite cycling iterator with proper documentation

The new cycle_iterator function elegantly replaces the previous BufferedIterator implementation, creating an infinite iterator that never raises StopIteration. This addresses the garbage collection issues mentioned in the PR objective.

Good practices used:

  • Clear docstring explaining functionality and return values
  • Using with torch.device("cpu") to ensure iterator creation happens on CPU
  • Using yield from for efficient delegation to the underlying iterator

181-199: DataLoader configuration enhancement

The renamed function now properly reflects its purpose, and the addition of pin_memory=True is a good optimization for GPU training that can speed up data transfer.


201-224: Variable naming consistency for iterator pattern

The variable names have been updated to match the new implementation pattern, consistently using *_iter instead of the previous buffer references.

@iProzd iProzd added this pull request to the merge queue May 19, 2025
Merged via the queue into deepmodeling:devel with commit 30b762e May 19, 2025
53 checks passed
@codecov
Copy link

codecov bot commented May 19, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 84.70%. Comparing base (a31b886) to head (42dcf4a).
⚠️ Report is 87 commits behind head on devel.

Additional details and impacted files
@@           Coverage Diff           @@
##            devel    #4737   +/-   ##
=======================================
  Coverage   84.69%   84.70%           
=======================================
  Files         697      697           
  Lines       67473    67425   -48     
  Branches     3541     3540    -1     
=======================================
- Hits        57145    57111   -34     
+ Misses       9196     9183   -13     
+ Partials     1132     1131    -1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants