scaling_fl

This repo contains the code used to produce the results in "Scaling Federated Learning for Fine-tuning of Large Language Models".

Requires

Binaries yq and docker.

Uses Sacred with a Mongo DB server for logging.

Usage

Before running, copy the .env.template file to .env and fill it out as you please. To run locally, use the run-local.sh script. This will build and run a container with the provided Sacred Mongo credentials set as environment variables, allowing you to use the same container in e.g. GCP AI Platform.

Running experiments locally

To run the agnews job with 4 sites locally, 2 on GPU 0 and 2 on GPU 1, run:

> JOB_NAME=$(date +%Y%m%d_%H%M%S)
> ./run-local.sh scripts/train_federated.py \
      with \
      site_ids=[0,0,1,1] \
      num_rounds=100 \
      num_local_epochs=2 \
      iid_split=True \
      task_name="agnews" \
      train_path="/path/to/agnews/train.csv" \
      test_path="/path/to/agnews/test.csv" \
      model="distilbert" \
      job_name="${JOB_NAME}" \
      checkpoints_dir="/path/to/checkpoints/${JOB_NAME}" \
      author="${USER}"

If you need to store / read from a local disk set up the $DATA_MOUNT_DIR in your .env. This will mount it to /workspace/data, and you'll be able to specify paths through there. Otherwise, you can also set up GCS / S3 credentials and store / read from such paths (e.g. gs:// or s3://).

Resuming runs

If the JOB_NAME matches a previous one and the checkpoints dir points to a dictionary with checkpoints, the latest checkpoint will be loaded and the job will continue.

Distributed usage

We do not provide complete scripts for running across nodes. Doing so is however relatively straightforward. You will need to make sure that the containers run so that they are visible to each other. You then need to specify the following environment variables.

Tools like GCP AI Platform or Kubeflow can set these up for you automatically.

Specifying clients

Internally we refer to clients as sites. Sites are set up by specifying a list of (local) GPU IDs where the sites should run. If a GPU should host multiple sites it should occur in that list multiple times.

Specifying models

We always start from a pre-trained model and fine-tune it on either task using FedAvg. The models are specified by providing model ids to pre-trained models in Huggingface Transformers. For convenience we have specified nicknames for the following models which we used in our paper. You can specify these instead of providing the full names:

Nickname	Full model name
albert	albert-base-v1
bert	bert-base-cased
distilbert	distilbert-base-cased

Specifying tasks

The following tasks are available and specified in the paper:

agnews
imdb
yelp
spooky_author

Apart from specifying the task name, agnews, yelp and spooky_author requires you to download the training and test CSV files and specifying paths to these. imdb, however, is downloaded directly through torchtext and doesn't take these parameters.

To use spooky_author you should download the dataset and split it into training and test subsets using the script scripts/spooky_author/split_data.py.

Debugging

Use the additional docker-compose.debug.yaml file to run the debug configuration and allow you to attach a debugger to the process. When running locally this can be done by setting the environment variable COMPOSE_FILE. This will open up port 5678 and wait for a python debugger to attach before continuing.

> COMPOSE_FILE="docker-compose.yaml:docker-compose.debug.yaml" ./run-local.sh ...

Citation

If you found this repo useful in your research please use the following citation for our paper "Scaling Federated Learning for Fine-tuning of Large Language Models":

@misc{hilmkil2021scaling,
      title={Scaling Federated Learning for Fine-tuning of Large Language Models}, 
      author={Agrin Hilmkil and Sebastian Callh and Matteo Barbieri and Leon René Sütfeld and Edvin Listo Zec and Olof Mogren},
      year={2021},
      eprint={2102.00875},
      archivePrefix={arXiv},
      primaryClass={cs.LG}
}

Acknowledgements

This work was part of the Vinnova project Svensk Medicinskt Språkdatalabb (grant 2019-05156). Compute resources were provided by AI Sweden, CGIT and Peltarion.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
scaling_fl		scaling_fl
scripts		scripts
.dockerignore		.dockerignore
.env.template		.env.template
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
docker-compose.debug.yaml		docker-compose.debug.yaml
docker-compose.yaml		docker-compose.yaml
requirements.txt		requirements.txt
run-local.sh		run-local.sh
setup.cfg		setup.cfg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

scaling_fl

Requires

Usage

Running experiments locally

Resuming runs

Distributed usage

Specifying clients

Specifying models

Specifying tasks

Debugging

Citation

Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

statDataAnalyzer/scaling_fl

Folders and files

Latest commit

History

Repository files navigation

scaling_fl

Requires

Usage

Running experiments locally

Resuming runs

Distributed usage

Specifying clients

Specifying models

Specifying tasks

Debugging

Citation

Acknowledgements

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages