Skip to content

Experiment stuck due to hitting Suggestion custom resource size limits #1847

Open
@nielsmeima

Description

@nielsmeima

/kind bug

What steps did you take and what happened:
Submitting a large (i.e. resulting in a large number of trials, in this case ~14500 with 4 hyperparameters with 10/11 values per hyperparameter) experiment results in the Suggestion custom resource reaching the size limits of custom resources dictated by Kubernetes due to all suggestions being stored in this resource. This results in the following error being output by the Katib controller when trying to update the Suggestion custom resource: Request entity too large and the experiment not being able to progress. This issue seems to describe the exact problem.

Argo Workflows seems to have encountered the same problem, described here and solved it by allowing for 1) compression of the data stored in the status field of the custom resource and 2) storage of information under the status field in a relational database as described here.

What did you expect to happen:
I expected Katib to be able to handle search spaces or arbitrary size.

Anything else you would like to add:
A workaround would be to manually split the experiment into smaller subexperiments to circumvent the size limits of custom resources. Ideally, this is solved by following a similar approach as Argo does for their Workflow custom resources.


Impacted by this bug? Give it a 👍 We prioritize the issues with the most 👍

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions