Skip to content

More changes to docs #73

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 35 commits into from
Oct 30, 2024
Merged
Changes from 1 commit
Commits
Show all changes
35 commits
Select commit Hold shift + click to select a range
3bd7f1b
Update launcher.py
pmcurtin Oct 19, 2024
2ce3576
Merge branch 'main' into docs-2
apoorvkh Oct 19, 2024
2731d31
Merge branch 'main' into docs-2
apoorvkh Oct 20, 2024
0b9c1df
moved log_handlers into .run()
apoorvkh Oct 20, 2024
af8c829
update contributing
apoorvkh Oct 20, 2024
4ac384e
add tyro, remove setuptools from extras
apoorvkh Oct 20, 2024
cbf40b9
enabled linting for docs; clarified public/private functions
apoorvkh Oct 20, 2024
76aa20f
docs for utils.py
apoorvkh Oct 20, 2024
de93aaf
docs for logging_utils
apoorvkh Oct 20, 2024
e4977fd
Merge branch 'docs-2' of github.com:apoorvkh/torchrunx into worker-ex…
apoorvkh Oct 20, 2024
e697257
advanced docs
apoorvkh Oct 20, 2024
748c2b7
adding napoleon for google docs
apoorvkh Oct 21, 2024
24f4a98
linkcode
apoorvkh Oct 21, 2024
cb6620c
update linkcode
apoorvkh Oct 21, 2024
3eb297c
try again
apoorvkh Oct 21, 2024
e609f54
fix?
apoorvkh Oct 21, 2024
e88e320
now linkcode works
apoorvkh Oct 21, 2024
bef8b28
updates
apoorvkh Oct 21, 2024
86bb67b
automethod run for launcher
apoorvkh Oct 21, 2024
d80d822
maximum_signature_line_length
apoorvkh Oct 21, 2024
9950e96
switch to members?
apoorvkh Oct 21, 2024
8276abc
Merge branch 'main' of github.com:apoorvkh/torchrunx into docs-2
apoorvkh Oct 29, 2024
f335140
created utils/
apoorvkh Oct 29, 2024
0b5e316
moved functions to worker.py
apoorvkh Oct 29, 2024
084061f
renamed to worker_entrypoint
apoorvkh Oct 29, 2024
6cc9311
completed docs for utils
apoorvkh Oct 29, 2024
490f2a8
more launcher docs
apoorvkh Oct 29, 2024
e54a533
more updates to docs
apoorvkh Oct 29, 2024
455c3f3
switched LaunchResult to get
apoorvkh Oct 29, 2024
f967218
bump hash in pixi lock
apoorvkh Oct 29, 2024
3a68eb6
removed overloading from LaunchResult
apoorvkh Oct 29, 2024
9e2d5f4
update all docs
apoorvkh Oct 30, 2024
a29212e
fix
apoorvkh Oct 30, 2024
7bf9222
small edits
apoorvkh Oct 30, 2024
122febc
how it works
apoorvkh Oct 30, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
automethod run for launcher
  • Loading branch information
apoorvkh committed Oct 21, 2024
commit 86bb67b0e1674bf40331000ef0ecce049cb565c6
31 changes: 17 additions & 14 deletions docs/source/advanced.rst
Original file line number Diff line number Diff line change
Expand Up @@ -28,22 +28,19 @@ We could also launch multiple functions (e.g. train on many GPUs, test on one GP

:mod:`torchrunx.launch` is self-cleaning: all processes are terminated (and the used memory is completely released) after each invocation.


SLURM integration
-----------------

By default, the ``hostnames`` or ``workers_per_host`` arguments are populated from the current SLURM allocation. If no allocation is detected, we assume 1 machine (``localhost``) with N workers (num. GPUs or CPUs).
Raises a ``RuntimeError`` if ``hostnames`` or ``workers_per_host`` are intentionally set to ``"slurm"`` but no allocation is detected.

CLI support
-----------
Launcher class
--------------

We provide the :mod:`torchrunx.Launcher` class as an alias to :mod:`torchrunx.launch`.

.. autoclass:: torchrunx.Launcher
:members: run
:single-line-parameter-list:
.. automethod:: run

We can use this class to populate arguments from the CLI (e.g. with `tyro <https://brentyi.github.io/tyro/>`_):
CLI integration
^^^^^^^^^^^^^^^

We can use :mod:`torchrunx.Launcher` to populate arguments from the CLI (e.g. with `tyro <https://brentyi.github.io/tyro/>`_):

.. code:: python

Expand Down Expand Up @@ -80,22 +77,28 @@ We can use this class to populate arguments from the CLI (e.g. with `tyro <https
│ (default: None) │
╰───────────────────────────────────────────────────────╯

Propagating Exceptions
SLURM integration
-----------------

By default, the ``hostnames`` or ``workers_per_host`` arguments are populated from the current SLURM allocation. If no allocation is detected, we assume 1 machine (localhost) with N workers (num. GPUs or CPUs).
Raises a ``RuntimeError`` if ``hostnames="slurm"`` or ``workers_per_host="slurm"`` but no allocation is detected.

Propagating exceptions
----------------------

Exceptions that are raised in Workers will be raised by the launcher process.

A :mod:`torchrunx.AgentKilledError` will be raised if any agent dies unexpectedly (e.g. if force-killed by the OS, due to segmentation faults or OOM).

Environment Variables
Environment variables
---------------------

Environment variables in the launcher process that match the ``default_env_vars`` argument are automatically copied to agents and workers. We set useful defaults for Python and PyTorch. Environment variables are pattern-matched with this list using ``fnmatch``.

``default_env_vars`` can be overriden if desired. This list can be augmented using ``extra_env_vars``. Additional environment variables (and more custom bash logic) can be included via the ``env_file`` argument. Our agents ``source`` this file.


Custom Logging
Custom logging
--------------

We forward all logs (i.e. from ``logging`` and ``stdio``) from workers and agents to the Launcher. By default, the logs from the first agent and its first worker are printed into the Launcher's ``stdout`` stream. Logs from all agents and workers are written to files in ``$TORCHRUNX_LOG_DIR`` (default: ``./torchrunx_logs``) and are named by timestamp, hostname, and local_rank.
Expand Down
Loading