Skip to content

Commit

Permalink
[Docs] Fix some issues with Managed Jobs example. (#4361)
Browse files Browse the repository at this point in the history
* [Docs] Fix some issues with Managed Jobs example.

* fix

* Fix env
  • Loading branch information
concretevitamin authored Nov 15, 2024
1 parent a404e3f commit 974ff4f
Showing 1 changed file with 44 additions and 49 deletions.
93 changes: 44 additions & 49 deletions docs/source/examples/managed-jobs.rst
Original file line number Diff line number Diff line change
Expand Up @@ -78,49 +78,47 @@ We can launch it with the following:

.. code-block:: console
$ git clone https://github.com/huggingface/transformers.git ~/transformers -b v4.30.1
$ sky jobs launch -n bert-qa bert_qa.yaml
.. code-block:: yaml
# bert_qa.yaml
name: bert-qa
resources:
accelerators: V100:1
# Use spot instances to save cost.
use_spot: true
# Assume your working directory is under `~/transformers`.
# To make this example work, please run the following command:
# git clone https://github.com/huggingface/transformers.git ~/transformers -b v4.30.1
workdir: ~/transformers
use_spot: true # Use spot instances to save cost.
setup: |
envs:
# Fill in your wandb key: copy from https://wandb.ai/authorize
# Alternatively, you can use `--env WANDB_API_KEY=$WANDB_API_KEY`
# to pass the key in the command line, during `sky jobs launch`.
echo export WANDB_API_KEY=[YOUR-WANDB-API-KEY] >> ~/.bashrc
WANDB_API_KEY:
# Assume your working directory is under `~/transformers`.
workdir: ~/transformers
setup: |
pip install -e .
cd examples/pytorch/question-answering/
pip install -r requirements.txt torch==1.12.1+cu113 --extra-index-url https://download.pytorch.org/whl/cu113
pip install wandb
run: |
cd ./examples/pytorch/question-answering/
cd examples/pytorch/question-answering/
python run_qa.py \
--model_name_or_path bert-base-uncased \
--dataset_name squad \
--do_train \
--do_eval \
--per_device_train_batch_size 12 \
--learning_rate 3e-5 \
--num_train_epochs 50 \
--max_seq_length 384 \
--doc_stride 128 \
--report_to wandb
--model_name_or_path bert-base-uncased \
--dataset_name squad \
--do_train \
--do_eval \
--per_device_train_batch_size 12 \
--learning_rate 3e-5 \
--num_train_epochs 50 \
--max_seq_length 384 \
--doc_stride 128 \
--report_to wandb \
--output_dir /tmp/bert_qa/
.. note::

Expand Down Expand Up @@ -162,55 +160,52 @@ An End-to-End Example
Below we show an `example <https://github.com/skypilot-org/skypilot/blob/master/examples/spot/bert_qa.yaml>`_ for fine-tuning a BERT model on a question-answering task with HuggingFace.

.. code-block:: yaml
:emphasize-lines: 13-16,42-45
:emphasize-lines: 8-11,41-44
# bert_qa.yaml
name: bert-qa
resources:
accelerators: V100:1
use_spot: true
# Assume your working directory is under `~/transformers`.
# To make this example work, please run the following command:
# git clone https://github.com/huggingface/transformers.git ~/transformers -b v4.30.1
workdir: ~/transformers
use_spot: true # Use spot instances to save cost.
file_mounts:
/checkpoint:
name: # NOTE: Fill in your bucket name
mode: MOUNT
setup: |
envs:
# Fill in your wandb key: copy from https://wandb.ai/authorize
# Alternatively, you can use `--env WANDB_API_KEY=$WANDB_API_KEY`
# to pass the key in the command line, during `sky jobs launch`.
echo export WANDB_API_KEY=[YOUR-WANDB-API-KEY] >> ~/.bashrc
WANDB_API_KEY:
# Assume your working directory is under `~/transformers`.
workdir: ~/transformers
setup: |
pip install -e .
cd examples/pytorch/question-answering/
pip install -r requirements.txt
pip install -r requirements.txt torch==1.12.1+cu113 --extra-index-url https://download.pytorch.org/whl/cu113
pip install wandb
run: |
cd ./examples/pytorch/question-answering/
cd examples/pytorch/question-answering/
python run_qa.py \
--model_name_or_path bert-base-uncased \
--dataset_name squad \
--do_train \
--do_eval \
--per_device_train_batch_size 12 \
--learning_rate 3e-5 \
--num_train_epochs 50 \
--max_seq_length 384 \
--doc_stride 128 \
--report_to wandb \
--run_name $SKYPILOT_TASK_ID \
--output_dir /checkpoint/bert_qa/ \
--save_total_limit 10 \
--save_steps 1000
--model_name_or_path bert-base-uncased \
--dataset_name squad \
--do_train \
--do_eval \
--per_device_train_batch_size 12 \
--learning_rate 3e-5 \
--num_train_epochs 50 \
--max_seq_length 384 \
--doc_stride 128 \
--report_to wandb \
--output_dir /checkpoint/bert_qa/ \
--run_name $SKYPILOT_TASK_ID \
--save_total_limit 10 \
--save_steps 1000
As HuggingFace has built-in support for periodically checkpointing, we only need to pass the highlighted arguments for setting up
the output directory and frequency of checkpointing (see more
Expand Down

0 comments on commit 974ff4f

Please sign in to comment.