-
Notifications
You must be signed in to change notification settings - Fork 151
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature Request] Collect metrics in a fixed interval for the lifespan of a training job #47
Comments
@classicboyir Hi, thanks for the feedback.
I think this would be a good use case and I would like to add this into import time
import threading
from nvitop import ResourceMetricCollector
def collect_in_background(
on_collect,
collector=None,
interval=None,
*,
on_start=None,
on_stop=None,
tag='metrics-daemon',
start=True,
):
if collector is None:
collector = ResourceMetricCollector()
if interval is None:
interval = collector.interval
interval = min(interval, collector.interval)
def target():
if on_start is not None:
on_start(collector)
try:
with collector(tag):
try:
while on_collect(collector.collect()):
time.sleep(interval)
except KeyboardInterrupt:
pass
finally:
if on_stop is not None:
on_stop(collector)
daemon = threading.Thread(target=target, daemon=True)
if start:
daemon.start()
return daemon def main():
logger = ...
def on_collect(metrics):
if logger.is_closed(): # closed manually by user
return False
logger.log(metrics)
return True
def on_stop(collector):
if not logger.is_closed():
logger.close() # cleanup
background_collector = ResourceMetricCollector()
collect_in_background(on_collect, background_collector, interval=5.0, on_stop=on_stop)
# Use a separate collector for foreground
# otherwise it will mess with the 'metrics-daemon' tag
foreground_collector = ResourceMetricCollector()
for epoch in range(100):
with foreground_collector('epoch'):
# Do something
for batch in range(100):
with foreground_collector('batch'):
# Do something
pass You can define a lst = []
def on_collect(metrics):
lst.append(metrics)
return True |
Love it, thanks for the quick response and look forward to seeing it being natively supported. |
@classicboyir Hi, I create a PR #48 to resolve this. Could you try: pip3 install git+https://github.com/XuehaiPan/nvitop.git@collector-daemon and share some user experiences. Then we can get it to merge and release. Thanks! |
thanks for the update, @XuehaiPan. I think it'd be better to define collect_in_background as a member of ResourceMetricCollector class and you call it like this: (and use something like collector = ResourceMetricCollector(interval=5.0) Instead of passing a ResourceMetricCollector object, it uses self as the collector and might just need these parameters in the begin_collecting_in_background function: def begin_collecting_in_background(
on_collect,
on_start=None,
on_stop=None,
tag='') -> threading.Thread: And you don't need the |
@classicboyir Thanks for the advice, I add a new shortcut method from nvitop import ResourceMetricCollector
collector = ResourceMetricCollector(...)
collector.daemonize(on_collect_fn, interval=inteval, on_start=on_start, on_stop=on_stop) it is equivalent to: from nvitop import ResourceMetricCollector, collect_in_background
collector = ResourceMetricCollector(...)
collect_in_background(on_collect_fn, collector, interval=inteval, on_start=on_start, on_stop=on_stop) but has fewer imports.
As for the parameter For the |
This feature is included in |
Thanks @XuehaiPan for adding this feature promptly. |
@classicboyir You can let the |
Hi @XuehaiPan,
In your examples to collect metrics using
ResourceMetricCollector
inside a training loop, the collector.collect(), collects a snapshot at each epoch/batch loop which misses the the entire period between the previous and current loop.If a loop takes 5 minutes, we have the metrics at 5 minutes interval.
I wonder if there is a way to run a process in background to collect the metrics at a certain interval let's say 5 seconds, during the lifespan of a training job?
Therefore if the entire job took 1hr, with the 5 sec interval, we collect 720 snapshots.
Thanks
The text was updated successfully, but these errors were encountered: