You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am using a private MLFlow server, which has a very high latency. This results in my training process to be stuck in logging 99% of the time.
In my mind, this should be easy to fix - simply open a thread in parallel to the trainer which does the logging in the background, while the trainer keeps training.
Pitch
Add a trainer flag "log_in_parallel" which detaches the logging from the training process, and lets it train before the logging is finished.
Alternatives
An alternative would be to store all the logging data offline and only synchronize it in predetermined intervals, such as every 10 epochs.
Additional context
99% is only a slight hyperbole, this is what my gpu usage looks like:
In my opinion, this should be discussed and addressed by MLFlow directly. Other logging frameworks already do that. It is not something that the training framework (in this case Lightning) should implement in my opinion. The goal of lightning is only to provide simple wrappers around the loggers to standardize the interface. If a feature is present in one logger and not the other, that's something the user has to consider when choosing the integration.
Description & Motivation
I am using a private MLFlow server, which has a very high latency. This results in my training process to be stuck in logging 99% of the time.
In my mind, this should be easy to fix - simply open a thread in parallel to the trainer which does the logging in the background, while the trainer keeps training.
Pitch
Add a trainer flag "log_in_parallel" which detaches the logging from the training process, and lets it train before the logging is finished.
Alternatives
An alternative would be to store all the logging data offline and only synchronize it in predetermined intervals, such as every 10 epochs.
Additional context
99% is only a slight hyperbole, this is what my gpu usage looks like:
cc @Borda
The text was updated successfully, but these errors were encountered: