Skip to content

Tags: instructlab/training

Tags

v0.14.1

Toggle v0.14.1's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
fix _no_split_modules subscript error for transformers v5 (#683)

v0.14.0

Toggle v0.14.0's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
Add MLflow support and expose logging configuration in TrainingArgs (#…

…680)

* add support for mlflow

* fix formatting changes

* Add tensorboard_log_dir to TrainingArgs for configurable TensorBoard logging

- Add tensorboard_log_dir field to TrainingArgs in config.py
- Update setup_metric_logger to use tensorboard_log_dir when provided
- Add CLI argument for tensorboard_log_dir
- Wire tensorboard_log_dir through run_training() to subprocess command

This allows users to specify a custom directory for TensorBoard logs,
defaulting to output_dir if not specified.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Address PR review feedback

- Replace defensive getattr() with direct attribute access in main_ds.py
  since args are guaranteed to exist from argparse defaults
- Remove unused log_dir parameter from MLflowHandler
- Add debug logging for non-numeric metrics skipped by MLflowHandler

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* removes generic `run_name` and `logger_type` kwargs

* review comments

* something something mlflow active runs

* review comments

* coderabbit

* adds install targets for logging backends

* add targets for loggers

* messaging

* comments

* interim changes

---------

Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>

v0.13.0

Toggle v0.13.0's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
Exposes API for processing pretraining data (#672)

This commit enables the data processing code to create pre-training style datasets. The training loop is also updated to ingest pretraining-style datasets, where documents are chunked by some `block_size` and the chunks are then treated as independent and fully-unmasked samples.

v0.12.1

Toggle v0.12.1's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
fix(torchrun): Omit empty arguments and correct nproc_per_node type (#…

…661)

* fix(torchrun): Omit empty arguments and correct nproc_per_node type

The command generation logic is updated to dynamically
build the torchrun command, excluding arguments that
are empty or None. This prevents them from overriding
environment variables, ensuring that torchrun can
correctly inherit its configuration. An exception is
made for integer arguments where 0 is a valid value.

Additionally, the nproc_per_node argument type has been
changed from int to str to support special values
accepted by PyTorch, such as 'auto', 'gpu', and 'cpu'.

Reference: https://github.com/pytorch/pytorch/blob/main/torch/distributed/run.py#L77-L88

Signed-off-by: Saad Zaher <szaher@redhat.com>

* only dynamically add torchrun args & change rdzv_id type to str

Signed-off-by: Saad Zaher <szaher@redhat.com>

* fix smoke tests

Signed-off-by: Saad Zaher <szaher@redhat.com>

* Enable both dtypes str, int for nproc_per_node, rdzv_id

Signed-off-by: Saad Zaher <szaher@redhat.com>

* Use python3.11 style for pydatnic model

Signed-off-by: Saad Zaher <szaher@redhat.com>

* add all torchrun args and validate them

Signed-off-by: Saad Zaher <szaher@redhat.com>

* Remove non-required dependencies

Signed-off-by: Saad Zaher <szaher@redhat.com>

* update datatypes only

Signed-off-by: Saad Zaher <szaher@redhat.com>

* replace _ with - when passing torchrun args

Signed-off-by: Saad Zaher <szaher@redhat.com>

* make nproc_per_node to only accept gpu or int

Signed-off-by: Saad Zaher <szaher@redhat.com>

* add master_{addr, port} validate args

Signed-off-by: Saad Zaher <szaher@redhat.com>

* check for not set or empty rdzv endpoint

Signed-off-by: Saad Zaher <szaher@redhat.com>

* fix formatting error

Signed-off-by: Saad Zaher <szaher@redhat.com>

* Update src/instructlab/training/config.py

Signed-off-by: Saad Zaher <szaher@redhat.com>

* Update tests/smoke/test_train.py

Signed-off-by: Saad Zaher <szaher@redhat.com>

* Update src/instructlab/training/main_ds.py

Signed-off-by: Saad Zaher <szaher@redhat.com>

* fixes indentation

Signed-off-by: Oleg Silkin <97077423+RobotSail@users.noreply.github.com>

* formatting

* add standalone as the fallback when neither master_addr nor rdzv_endpoint are provided

Signed-off-by: Oleg Silkin <97077423+RobotSail@users.noreply.github.com>

* clarify rdzv-backend arg

---------

Signed-off-by: Saad Zaher <szaher@redhat.com>
Signed-off-by: Oleg Silkin <97077423+RobotSail@users.noreply.github.com>
Co-authored-by: Oleg Silkin <97077423+RobotSail@users.noreply.github.com>

v0.12.0

Toggle v0.12.0's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
Add kernels>0.9.0 to CUDA requirements (#658)

Signed-off-by: Mustafa Eyceoz <meyceoz@redhat.com>

v0.11.1

Toggle v0.11.1's commit message
Fix isort errors

v0.10.4

Toggle v0.10.4's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
Merge pull request #634 from instructlab/mergify/bp/release-v0.10/pr-628

uncap accelerate in `requirements-cuda.txt` (backport #628)

v0.10.3

Toggle v0.10.3's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
Merge pull request #546 from instructlab/mergify/bp/release-v0.10/pr-455

moves deepspeed requirements into their own file; add deepspeed extras (backport #455)

v0.11

Toggle v0.11's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
Merge pull request #528 from fynnsu/pylint-unused-argument

Enable pylint 'unused-argument' check

v0.10.2

Toggle v0.10.2's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
Merge pull request #518 from instructlab/mergify/bp/release-v0.10/pr-517

deps: Remove caps on ROCm dependencies (backport #517)