feat: implement dynamic batch sizing for ML validation#816
feat: implement dynamic batch sizing for ML validation#816khresth wants to merge 7 commits intoSamsung:mainfrom
Conversation
- Add MemoryMonitor class for real-time memory tracking - Enhance MlValidator with adaptive batch size calculation - Add memory pressure detection and automatic adjustment - Implement garbage collection optimization - Add comprehensive test coverage for all components - Update requirements.txt with psutil dependency This improves performance by maximizing throughput on systems with abundant memory while preventing OOM errors on constrained systems.
babenek
left a comment
There was a problem hiding this comment.
Hi @khresth
thank you for the good idea - limit the batch size based on memory consumption.
Unfortunately it useless in some cases - memory limitation by supervisor (it kills a process with an overhead - no any evaluations or garbage collections).
IMHO, psutil is useless because memory allocation for batch is predictable (with some deviations) - it can be estimated even with tests.
Tests: the correct test may be done in TestApp class with subprocess launch and memory limitation. Negative and positive test cases.
Example:
$ ulimit -v 1000000
$ python -m credsweeper --path .
onnxruntime.capi.onnxruntime_pybind11_state.RuntimeException: [ONNXRuntimeError] : 6 : RUNTIME_EXCEPTION : Exception during initialization: std::bad_alloc
$ python -m credsweeper --path . --ml_batch_size 4
Detected Credentials: 4362
Time Elapsed: 34.02959942817688s
The default batch size is ok for 2G limit. Please, provide more details which issue you solved. Performance may be unstable also with odd/even batch size (CPU/GPU threads limitation). BTW, GPU case may have another limitations...
So, my proposal is to add precalculated minimal size of memory for a batch size and print them in --help (ml_engine may allocate unpredictable size). Otherwise, the tool should be used a bit different if the memory limitation exists.
Your tests should pass in fork action first (main branch launches some of them with push)
Updated beautifulsoup4 to version 4.14.3 and added striprtf version 0.0.29. Removed psutil version 6.1.0.
Added memory usage estimates and methods for batch size handling.
Reverted to original batch size usage
Enhanced help text with memory info
Implement unit tests for memory limits in ML validation.
babenek
left a comment
There was a problem hiding this comment.
- --help iprovements only (python gc is useless) - define limits
- tests
- CI in your fork first
| if memory_mb < 1000: | ||
| info_lines.append(f" Batch size {batch_size:3d}: ~{memory_mb}MB") | ||
| else: | ||
| info_lines.append(f" Batch size {batch_size:3d}: ~{memory_mb//1000}GB") |
| except subprocess.TimeoutExpired: | ||
| self.skipTest("ML validation timed out") | ||
| except Exception as e: | ||
| self.skipTest(f"Could not run ML validation: {e}") |
There was a problem hiding this comment.
there should be at least 2 type of tests:
def test_xxx_n(self...): - negative. Discover CredSweeper failure for memory consumption in a limit and huge ml_batch
def test_xxx_p(self...): - positive. Discover the failure has gone when batch reduced adn/or memory limit increased
The limitations should related with help info.
|
|
||
| def test_low_memory_batch_size(self): | ||
| """Test that small batch size works under memory constraints""" | ||
| test_file = self.test_dir / "memory_test_data.txt" |
There was a problem hiding this comment.
temporally files must be created in temporally directory
|
|
||
| def force_garbage_collection(self) -> float: | ||
| before_mb = self.get_memory_info().process_mb | ||
| gc.collect() |
There was a problem hiding this comment.
Suppose, the python gc is not helpful when onnxruntime allocated memory in native code. class MemoryMonitor is not necessary. There are only recommendation of memory consumption in help and the tests valuable.
| parser.add_argument("--ml_batch_size", | ||
| "-b", | ||
| help="batch size for model inference (default: 16)", | ||
| help=f"batch size for model inference (default: 16)\n\n{MlValidator.get_memory_info_text()}", |
There was a problem hiding this comment.
Let use a constant for minimal required memory for the default batch size. The constant must be used in tests for memory limitation. Try subprocess+resource (the tests may be skipped for Windows)
Description
Implemented dynamic batch sizing for ML validation to optimize memory usage and performance across different system configurations. The feature automatically adjusts batch sizes based on available memory, preventing OOM errors while maximizing throughput.
Changes Made:
This improves performance by maximizing throughput on systems with abundant memory while preventing OOM errors on constrained systems.
How has this been tested?