Skip to content

Avoid workers waiting for training of surrogate model to finish #1170

@bbudescu

Description

@bbudescu

Motivation

Here's how a cpu load graph looks like for a multi-objective optimization session using the multi-fidelity facade that ran for about 46 hours an a 64 core machine (no hyperthreading, n_workers=64), finishing almost 20k trials on a bit over 16k distinct configurations (two rungs).

Screenshot 2024-11-23 at 11-59-52 Instances EC2 eu-west-1

One can see that cpu utilizations decreases to less than 50% after the first 12 hours. It then drops to under 40% after another 10 hours (by this time, 12.6k trials were finished in total).

Previous Discussion

I thought that another cause for this degradation in performance might be Hyperband, and thought that using ASHA (#1169) instead would help eliminate that hypothesis, however, after @eddiebergman's #1169 (comment), I understand the problem is caused by workers waiting to get another suggestion from the surrogate model

Potential solution

  • train the random forest in a different thread / process
  • replace a version of the RF with a newly trained one only when the training is done (similar to double buffering)
  • workers should always get configs using the currently available RF, even if a new RF is training in the background
  • optionally: use an occupancy threshold, e.g., 90%, and allow worker threads to wait for training to finish only if the percentage of workers that are idle waiting for the new RF version is below 10%.
  • optionally: add gpu support to accelerate training of Random Forest
  • optionally: perhaps add the option to decrease the number of workers running the target function by 1 once the RF trainer occupies a CPU core for more than 50% of time

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions