Description
/kind feature
Describe the solution you'd like
[A clear and concise description of what you want to happen.]
Now the design of metrics collector is based on pull. We have a metrics collector cron job for one trial. And it collects logs according to the pods log. Then it parses the log and persist the logs in MySQL.
The design has some problems (kubeflow/trainer#722 (comment)) @johnugeorge proposed a push-based model to avoid the problems caused by the current design. And I also have some ideas about it.
In my design, we need a push-based implementation to push the metrics to prometheus. Then we can use custom-metrics-server to expose the trial or job level metrics. Then katib could get all periodical metrics from k8s master API. The early stopping services can use the API to determine if we should kill the trial. And UI can use the API to show the metrics.
And, tfjob and pytorchjob can also benefit from the metrics collector. Because we can use it collect periodical metrics for them, too. And the metrics will be exposed by a kubernetes native way: K8s metrics API
Anything else you would like to add:
[Miscellaneous information that will assist in solving the issue.]