Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request] Collect metrics in a fixed interval for the lifespan of a training job #47

Closed
hosseinsarshar opened this issue Nov 12, 2022 · 8 comments · Fixed by #48
Labels
api Something related to the core APIs enhancement New feature or request

Comments

@hosseinsarshar
Copy link

hosseinsarshar commented Nov 12, 2022

Hi @XuehaiPan,

In your examples to collect metrics using ResourceMetricCollector inside a training loop, the collector.collect(), collects a snapshot at each epoch/batch loop which misses the the entire period between the previous and current loop.
If a loop takes 5 minutes, we have the metrics at 5 minutes interval.

I wonder if there is a way to run a process in background to collect the metrics at a certain interval let's say 5 seconds, during the lifespan of a training job?

Therefore if the entire job took 1hr, with the 5 sec interval, we collect 720 snapshots.

Thanks

@XuehaiPan
Copy link
Owner

@classicboyir Hi, thanks for the feedback.

I wonder if there is a way to run a process in background to collect the metrics at a certain internal let's say 5 seconds, during the lifespan of a training job?

I think this would be a good use case and I would like to add this into nvitop. It can achieve by running in a separate thread with a callback function, like:

import time
import threading

from nvitop import ResourceMetricCollector


def collect_in_background(
    on_collect,
    collector=None,
    interval=None,
    *,
    on_start=None,
    on_stop=None,
    tag='metrics-daemon',
    start=True,
):
    if collector is None:
        collector = ResourceMetricCollector()
    if interval is None:
        interval = collector.interval
    interval = min(interval, collector.interval)

    def target():
        if on_start is not None:
            on_start(collector)
        try:
            with collector(tag):
                try:
                    while on_collect(collector.collect()):
                        time.sleep(interval)
                except KeyboardInterrupt:
                    pass
        finally:
            if on_stop is not None:
                on_stop(collector)

    daemon = threading.Thread(target=target, daemon=True)
    if start:
        daemon.start()
    return daemon
def main():
    logger = ...

    def on_collect(metrics):
        if logger.is_closed():  # closed manually by user
            return False
        logger.log(metrics)
        return True

    def on_stop(collector):
        if not logger.is_closed():
            logger.close()  # cleanup

    background_collector = ResourceMetricCollector()
    collect_in_background(on_collect, background_collector, interval=5.0, on_stop=on_stop)

    # Use a separate collector for foreground
    # otherwise it will mess with the 'metrics-daemon' tag
    foreground_collector = ResourceMetricCollector()

    for epoch in range(100):
        with foreground_collector('epoch'):
            # Do something
            for batch in range(100):
                with foreground_collector('batch'):
                    # Do something
                    pass

You can define a on_collect, such as log the result to a logger, or just append it in a list:

lst = [] 

def on_collect(metrics):
    lst.append(metrics)
    return True

@XuehaiPan XuehaiPan added enhancement New feature or request api Something related to the core APIs labels Nov 12, 2022
@hosseinsarshar
Copy link
Author

Love it, thanks for the quick response and look forward to seeing it being natively supported.

@hosseinsarshar hosseinsarshar changed the title [Question/Feature Request] Collect metrics in a fixed internal for the lifespan of a training job [Question/Feature Request] Collect metrics in a fixed interval for the lifespan of a training job Nov 13, 2022
@XuehaiPan XuehaiPan changed the title [Question/Feature Request] Collect metrics in a fixed interval for the lifespan of a training job [Feature Request] Collect metrics in a fixed interval for the lifespan of a training job Nov 17, 2022
@XuehaiPan
Copy link
Owner

@classicboyir Hi, I create a PR #48 to resolve this. Could you try:

pip3 install git+https://github.com/XuehaiPan/nvitop.git@collector-daemon

and share some user experiences. Then we can get it to merge and release. Thanks!

@hosseinsarshar
Copy link
Author

hosseinsarshar commented Nov 18, 2022

thanks for the update, @XuehaiPan.
I gave this a try, I love it and it works as expected. I do have a suggestion on the design of the method.

I think it'd be better to define collect_in_background as a member of ResourceMetricCollector class and you call it like this: (and use something like begin_collecting_in_background as the function name)

collector = ResourceMetricCollector(interval=5.0)
daemon = collector.begin_collecting_in_background(on_collect, on_stop=on_stop)

Instead of passing a ResourceMetricCollector object, it uses self as the collector and might just need these parameters in the begin_collecting_in_background function:

def begin_collecting_in_background(
        on_collect,
        on_start=None,
        on_stop=None,
        tag='') -> threading.Thread:

And you don't need the start parameter as when you call the begin_collecting_in_background function the intention is to start the background thread. Similarly, interval could be eliminated as it grabs the interval parameter of the ResourceMetricCollector class. Finally it'd return the daemon object to stop the job for the client to manage the thread.

@XuehaiPan
Copy link
Owner

@classicboyir Thanks for the advice, I add a new shortcut method daemonize to the class ResourceMetricCollector:

from nvitop import ResourceMetricCollector

collector = ResourceMetricCollector(...)
collector.daemonize(on_collect_fn, interval=inteval, on_start=on_start, on_stop=on_stop)

it is equivalent to:

from nvitop import ResourceMetricCollector, collect_in_background

collector = ResourceMetricCollector(...)
collect_in_background(on_collect_fn, collector, interval=inteval, on_start=on_start, on_stop=on_stop)

but has fewer imports.


And you don't need the start parameter as when you call the begin_collecting_in_background function the intention is to start the background thread. Similarly, interval could be eliminated as it grabs the interval parameter of the ResourceMetricCollector class.

As for the parameter on_start, I think the user may look up the collector.devices or some other attributes at start-up. This method not only initializes the collector but also does some necessary jobs on start.

For the interval argument, if you omit or pass interval=None, it will use collecor.interval.

@XuehaiPan
Copy link
Owner

This feature is included in nvitop 0.10.2.

@hosseinsarshar
Copy link
Author

Thanks @XuehaiPan for adding this feature promptly.
Would you also expose a function to stop the background thread when needed?

@XuehaiPan
Copy link
Owner

Would you also expose a function to stop the background thread when needed?

@classicboyir You can let the on_collect function return False to stop the thread. Also, the thread is a daemon thread, you can kill it anyway without breaking the main thread.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api Something related to the core APIs enhancement New feature or request
Projects
None yet
2 participants