Tags: instructlab/training
Tags
Add MLflow support and expose logging configuration in TrainingArgs (#… …680) * add support for mlflow * fix formatting changes * Add tensorboard_log_dir to TrainingArgs for configurable TensorBoard logging - Add tensorboard_log_dir field to TrainingArgs in config.py - Update setup_metric_logger to use tensorboard_log_dir when provided - Add CLI argument for tensorboard_log_dir - Wire tensorboard_log_dir through run_training() to subprocess command This allows users to specify a custom directory for TensorBoard logs, defaulting to output_dir if not specified. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Address PR review feedback - Replace defensive getattr() with direct attribute access in main_ds.py since args are guaranteed to exist from argparse defaults - Remove unused log_dir parameter from MLflowHandler - Add debug logging for non-numeric metrics skipped by MLflowHandler Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * removes generic `run_name` and `logger_type` kwargs * review comments * something something mlflow active runs * review comments * coderabbit * adds install targets for logging backends * add targets for loggers * messaging * comments * interim changes --------- Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
Exposes API for processing pretraining data (#672) This commit enables the data processing code to create pre-training style datasets. The training loop is also updated to ingest pretraining-style datasets, where documents are chunked by some `block_size` and the chunks are then treated as independent and fully-unmasked samples.
fix(torchrun): Omit empty arguments and correct nproc_per_node type (#… …661) * fix(torchrun): Omit empty arguments and correct nproc_per_node type The command generation logic is updated to dynamically build the torchrun command, excluding arguments that are empty or None. This prevents them from overriding environment variables, ensuring that torchrun can correctly inherit its configuration. An exception is made for integer arguments where 0 is a valid value. Additionally, the nproc_per_node argument type has been changed from int to str to support special values accepted by PyTorch, such as 'auto', 'gpu', and 'cpu'. Reference: https://github.com/pytorch/pytorch/blob/main/torch/distributed/run.py#L77-L88 Signed-off-by: Saad Zaher <szaher@redhat.com> * only dynamically add torchrun args & change rdzv_id type to str Signed-off-by: Saad Zaher <szaher@redhat.com> * fix smoke tests Signed-off-by: Saad Zaher <szaher@redhat.com> * Enable both dtypes str, int for nproc_per_node, rdzv_id Signed-off-by: Saad Zaher <szaher@redhat.com> * Use python3.11 style for pydatnic model Signed-off-by: Saad Zaher <szaher@redhat.com> * add all torchrun args and validate them Signed-off-by: Saad Zaher <szaher@redhat.com> * Remove non-required dependencies Signed-off-by: Saad Zaher <szaher@redhat.com> * update datatypes only Signed-off-by: Saad Zaher <szaher@redhat.com> * replace _ with - when passing torchrun args Signed-off-by: Saad Zaher <szaher@redhat.com> * make nproc_per_node to only accept gpu or int Signed-off-by: Saad Zaher <szaher@redhat.com> * add master_{addr, port} validate args Signed-off-by: Saad Zaher <szaher@redhat.com> * check for not set or empty rdzv endpoint Signed-off-by: Saad Zaher <szaher@redhat.com> * fix formatting error Signed-off-by: Saad Zaher <szaher@redhat.com> * Update src/instructlab/training/config.py Signed-off-by: Saad Zaher <szaher@redhat.com> * Update tests/smoke/test_train.py Signed-off-by: Saad Zaher <szaher@redhat.com> * Update src/instructlab/training/main_ds.py Signed-off-by: Saad Zaher <szaher@redhat.com> * fixes indentation Signed-off-by: Oleg Silkin <97077423+RobotSail@users.noreply.github.com> * formatting * add standalone as the fallback when neither master_addr nor rdzv_endpoint are provided Signed-off-by: Oleg Silkin <97077423+RobotSail@users.noreply.github.com> * clarify rdzv-backend arg --------- Signed-off-by: Saad Zaher <szaher@redhat.com> Signed-off-by: Oleg Silkin <97077423+RobotSail@users.noreply.github.com> Co-authored-by: Oleg Silkin <97077423+RobotSail@users.noreply.github.com>
PreviousNext