-
-
Notifications
You must be signed in to change notification settings - Fork 238
Description
Motivation
Here's how a cpu load graph looks like for a multi-objective optimization session using the multi-fidelity facade that ran for about 46 hours an a 64 core machine (no hyperthreading, n_workers=64
), finishing almost 20k trials on a bit over 16k distinct configurations (two rungs).
One can see that cpu utilizations decreases to less than 50% after the first 12 hours. It then drops to under 40% after another 10 hours (by this time, 12.6k trials were finished in total).
Previous Discussion
I thought that another cause for this degradation in performance might be Hyperband, and thought that using ASHA (#1169) instead would help eliminate that hypothesis, however, after @eddiebergman's #1169 (comment), I understand the problem is caused by workers waiting to get another suggestion from the surrogate model
Potential solution
- train the random forest in a different thread / process
- replace a version of the RF with a newly trained one only when the training is done (similar to double buffering)
- workers should always get configs using the currently available RF, even if a new RF is training in the background
- optionally: use an occupancy threshold, e.g., 90%, and allow worker threads to wait for training to finish only if the percentage of workers that are idle waiting for the new RF version is below 10%.
- optionally: add gpu support to accelerate training of Random Forest
- optionally: perhaps add the option to decrease the number of workers running the target function by 1 once the RF trainer occupies a CPU core for more than 50% of time