Description
/kind bug
What steps did you take and what happened:
Submitting a large (i.e. resulting in a large number of trials, in this case ~14500 with 4 hyperparameters with 10/11 values per hyperparameter) experiment results in the Suggestion
custom resource reaching the size limits of custom resources dictated by Kubernetes due to all suggestions being stored in this resource. This results in the following error being output by the Katib controller when trying to update the Suggestion custom resource: Request entity too large
and the experiment not being able to progress. This issue seems to describe the exact problem.
Argo Workflows seems to have encountered the same problem, described here and solved it by allowing for 1) compression of the data stored in the status field of the custom resource and 2) storage of information under the status field in a relational database as described here.
What did you expect to happen:
I expected Katib to be able to handle search spaces or arbitrary size.
Anything else you would like to add:
A workaround would be to manually split the experiment into smaller subexperiments to circumvent the size limits of custom resources. Ideally, this is solved by following a similar approach as Argo does for their Workflow custom resources.
Impacted by this bug? Give it a 👍 We prioritize the issues with the most 👍