|
| 1 | +# Efficient LAMMPS Trajectory Frame Reading |
| 2 | + |
| 3 | +This document describes the efficient trajectory frame reading functionality implemented for LAMMPS dump files in dpdata, addressing issue #367. |
| 4 | + |
| 5 | +## Overview |
| 6 | + |
| 7 | +The traditional approach to reading MD trajectories loads all frames into memory and then filters them. This can be inefficient when you only need specific frames from large trajectory files. The new implementation allows you to specify exactly which frames to read, skipping unwanted frames entirely. |
| 8 | + |
| 9 | +## Key Features |
| 10 | + |
| 11 | +### 1. Selective Frame Reading |
| 12 | + |
| 13 | +Instead of loading entire trajectories, you can now specify exactly which frames to read: |
| 14 | + |
| 15 | +```python |
| 16 | +import dpdata |
| 17 | + |
| 18 | +# Load only frames 23, 56, and 78 from a trajectory |
| 19 | +system = dpdata.System( |
| 20 | + 'trajectory.dump', |
| 21 | + fmt='lammps/dump', |
| 22 | + type_map=['O', 'H'], |
| 23 | + f_idx=[23, 56, 78] |
| 24 | +) |
| 25 | +``` |
| 26 | + |
| 27 | +### 2. Multi-Trajectory Pattern |
| 28 | + |
| 29 | +The implementation supports the frames_dict pattern requested in the issue: |
| 30 | + |
| 31 | +```python |
| 32 | +import dpdata.lammps.dump as dump |
| 33 | + |
| 34 | +frames_dict = { |
| 35 | + 'trajectory1.dump': [23, 56, 78], |
| 36 | + 'trajectory2.dump': [22], |
| 37 | + 'trajectory3.dump': [10, 20, 30, 40] |
| 38 | +} |
| 39 | + |
| 40 | +# Load specified frames from multiple trajectories |
| 41 | +data = dump.load_frames_from_trajectories(frames_dict, type_map=['O', 'H']) |
| 42 | +``` |
| 43 | + |
| 44 | +### 3. Efficient Block Reading |
| 45 | + |
| 46 | +The implementation uses block-based reading with `itertools.zip_longest` to skip frames efficiently: |
| 47 | + |
| 48 | +- Determines frame structure (lines per frame) upfront |
| 49 | +- Reads only requested frame blocks |
| 50 | +- Skips unwanted frames without processing them |
| 51 | + |
| 52 | +## API Reference |
| 53 | + |
| 54 | +### Enhanced System Constructor |
| 55 | + |
| 56 | +```python |
| 57 | +dpdata.System( |
| 58 | + file_name, |
| 59 | + fmt='lammps/dump', |
| 60 | + f_idx=None, # NEW: List of frame indices to load |
| 61 | + **kwargs |
| 62 | +) |
| 63 | +``` |
| 64 | + |
| 65 | +**Parameters:** |
| 66 | +- `f_idx` (list[int], optional): Specific frame indices to load (0-based). If provided, `begin` and `step` parameters are ignored. |
| 67 | + |
| 68 | +### New Functions |
| 69 | + |
| 70 | +#### `dpdata.lammps.dump.read_frames(fname, f_idx)` |
| 71 | + |
| 72 | +Efficiently read specific frames from a LAMMPS dump file. |
| 73 | + |
| 74 | +**Parameters:** |
| 75 | +- `fname`: The dump file path |
| 76 | +- `f_idx`: List of frame indices to read (0-based) |
| 77 | + |
| 78 | +**Returns:** |
| 79 | +- List of lines for the requested frames |
| 80 | + |
| 81 | +#### `dpdata.lammps.dump.load_frames_from_trajectories(frames_dict, **kwargs)` |
| 82 | + |
| 83 | +Load frames from multiple trajectory files using the frames_dict pattern. |
| 84 | + |
| 85 | +**Parameters:** |
| 86 | +- `frames_dict`: Dictionary mapping file paths to lists of frame indices |
| 87 | +- `**kwargs`: Additional arguments passed to `system_data` (e.g., `type_map`, `unwrap`) |
| 88 | + |
| 89 | +**Returns:** |
| 90 | +- Combined system data dictionary |
| 91 | + |
| 92 | +#### `dpdata.lammps.dump.get_frame_nlines(fname)` |
| 93 | + |
| 94 | +Determine the number of lines per frame in a LAMMPS dump file. |
| 95 | + |
| 96 | +**Parameters:** |
| 97 | +- `fname`: The dump file path |
| 98 | + |
| 99 | +**Returns:** |
| 100 | +- Number of lines per frame (int) |
| 101 | + |
| 102 | +## Performance Benefits |
| 103 | + |
| 104 | +The efficient frame reading provides several advantages: |
| 105 | + |
| 106 | +1. **Memory Efficiency**: Only loads requested frames into memory |
| 107 | +2. **I/O Efficiency**: Skips unwanted frames during file reading |
| 108 | +3. **Processing Efficiency**: No need to process and then discard unwanted frames |
| 109 | + |
| 110 | +For large trajectory files with many frames, this can provide significant speedups when you only need a small subset of frames. |
| 111 | + |
| 112 | +## Backward Compatibility |
| 113 | + |
| 114 | +The implementation maintains full backward compatibility: |
| 115 | + |
| 116 | +- Existing code using `begin` and `step` parameters continues to work unchanged |
| 117 | +- All existing tests pass without modification |
| 118 | +- The new `f_idx` parameter is optional and defaults to `None` |
| 119 | + |
| 120 | +## Examples |
| 121 | + |
| 122 | +### Basic Usage |
| 123 | + |
| 124 | +```python |
| 125 | +import dpdata |
| 126 | + |
| 127 | +# Traditional approach (loads all frames) |
| 128 | +system_all = dpdata.System('traj.dump', fmt='lammps/dump', type_map=['O', 'H']) |
| 129 | + |
| 130 | +# Efficient approach (loads only specific frames) |
| 131 | +system_subset = dpdata.System( |
| 132 | + 'traj.dump', |
| 133 | + fmt='lammps/dump', |
| 134 | + type_map=['O', 'H'], |
| 135 | + f_idx=[10, 50, 100] |
| 136 | +) |
| 137 | +``` |
| 138 | + |
| 139 | +### Multi-Trajectory Loading |
| 140 | + |
| 141 | +```python |
| 142 | +import dpdata.lammps.dump as dump |
| 143 | + |
| 144 | +# Define which frames to load from each trajectory |
| 145 | +frames_dict = { |
| 146 | + 'run1/traj.dump': [100, 200, 300], |
| 147 | + 'run2/traj.dump': [50, 150, 250], |
| 148 | + 'run3/traj.dump': [75, 175] |
| 149 | +} |
| 150 | + |
| 151 | +# Load all specified frames |
| 152 | +data = dump.load_frames_from_trajectories(frames_dict, type_map=['C', 'H', 'O']) |
| 153 | + |
| 154 | +# Convert to dpdata System if needed |
| 155 | +system = dpdata.System(data=data) |
| 156 | +``` |
| 157 | + |
| 158 | +### Performance Comparison |
| 159 | + |
| 160 | +```python |
| 161 | +import time |
| 162 | +import dpdata |
| 163 | + |
| 164 | +# Time traditional approach |
| 165 | +start = time.time() |
| 166 | +system = dpdata.System('large_traj.dump', fmt='lammps/dump', type_map=['O', 'H']) |
| 167 | +filtered = system.sub_system([100, 500, 1000]) |
| 168 | +traditional_time = time.time() - start |
| 169 | + |
| 170 | +# Time efficient approach |
| 171 | +start = time.time() |
| 172 | +system = dpdata.System( |
| 173 | + 'large_traj.dump', |
| 174 | + fmt='lammps/dump', |
| 175 | + type_map=['O', 'H'], |
| 176 | + f_idx=[100, 500, 1000] |
| 177 | +) |
| 178 | +efficient_time = time.time() - start |
| 179 | + |
| 180 | +print(f"Speedup: {traditional_time / efficient_time:.1f}x") |
| 181 | +``` |
| 182 | + |
| 183 | +## Implementation Details |
| 184 | + |
| 185 | +### Frame Structure Detection |
| 186 | + |
| 187 | +The implementation first reads the file to determine the frame structure: |
| 188 | + |
| 189 | +1. Finds the first "ITEM: TIMESTEP" line |
| 190 | +2. Counts lines until the next "ITEM: TIMESTEP" |
| 191 | +3. Uses this count as the number of lines per frame |
| 192 | + |
| 193 | +### Block-Based Reading |
| 194 | + |
| 195 | +For selective frame reading: |
| 196 | + |
| 197 | +1. Sorts requested frame indices for sequential access |
| 198 | +2. Uses file position to skip to frame boundaries |
| 199 | +3. Reads frame blocks only for requested indices |
| 200 | +4. Combines results while preserving order |
| 201 | + |
| 202 | +### Error Handling |
| 203 | + |
| 204 | +The implementation handles various edge cases gracefully: |
| 205 | + |
| 206 | +- Empty frame index lists return empty results |
| 207 | +- Out-of-range indices are skipped silently |
| 208 | +- Duplicate indices are automatically deduplicated |
| 209 | +- Negative indices are ignored |
| 210 | + |
| 211 | +This ensures robust operation even with invalid input. |
0 commit comments