What's Changed
- Fix a bug in DS3 for A100/H100 nodes
- Added support for S3 datapath
- Change lr scheduler to cosine (credit: @SmerkyG )
- Changes to step calculation (may affect your training scripts/templates)
- Several bug fixes (credit: @SmerkyG )
Full Changelog: v2.2.1...v2.3.0
Example of S3 datapath config
data:
# Skip the datapath setup
#
# ignored if using the preload_datapath.py, useful for speeding up the trainer startup
# provided you have your datasets all properly preinitialized
# ---
skip_datapath_setup: True
# dataset_path for the prebuilt dataset, using HF `load_from_disk()`
#
# Use this if you have built your own dataset and saved it with `save_to_disk()`
# with source left as null. Other wise configure this to a directory which the
# dataset will be built and tokenized by the huggingface dataset process.
#
# If using relative path, this should be relative to the trainer script path
data_path: s3://bucket-name/subpath/
# Data path storage options, this is used to support cloud storage
# via the huggingface dataset API. See:
# https://huggingface.co/docs/datasets/v2.16.1/en/filesystems#amazon-s3
#
# Note: As of Jan 2023, these options has been only tested to work with AWS S3, and backblaze. YMMV
# For S3 bucket support you will also need to install s3fs `python3 -m pip install s3fs`
#
# If you want to reduce the risk of accidental key/secret commits, you can use
# `AWS_ACCESS_KEY_ID` and `AWS_SECRET_ACCESS_KEY` environment variables instead
#
# For datapath, it should use the `s3://bucket-name/subpath` format
# ---
data_path_storage_options:
key: <example S3 key>
secret: <example S3 secret>
endpoint_url: <example S3 endpoint>