Merge remote-tracking branch 'upstream/master'

skypilot-org · Nov 16, 2024 · 0a72839 · 0a72839
2 parents c1726ae + 88813ce
commit 0a72839
Show file tree

Hide file tree

Showing 30 changed files with 970 additions and 444 deletions.
diff --git a/docs/source/_static/custom.js b/docs/source/_static/custom.js
@@ -7,7 +7,7 @@ document.addEventListener('DOMContentLoaded', function () {
        script.setAttribute('data-project-logo', 'https://avatars.githubusercontent.com/u/109387420?s=100&v=4');
        script.setAttribute('data-modal-disclaimer', 'Results are automatically generated and may be inaccurate or contain inappropriate information. Do not include any sensitive information in your query.\n**To get further assistance, you can chat directly with the development team** by joining the [SkyPilot Slack](https://slack.skypilot.co/).');
        script.setAttribute('data-modal-title', 'SkyPilot Docs AI - Ask a Question.');
-       script.setAttribute('data-button-position-bottom', '85px');
+       script.setAttribute('data-button-position-bottom', '100px');
        script.async = true;
        document.head.appendChild(script);
 });

diff --git a/docs/source/examples/managed-jobs.rst b/docs/source/examples/managed-jobs.rst
@@ -78,49 +78,47 @@ We can launch it with the following:
 
 .. code-block:: console
 
+  $ git clone https://github.com/huggingface/transformers.git ~/transformers -b v4.30.1
   $ sky jobs launch -n bert-qa bert_qa.yaml
 
-
 .. code-block:: yaml
 
   # bert_qa.yaml
   name: bert-qa
 
   resources:
     accelerators: V100:1
-    # Use spot instances to save cost.
-    use_spot: true
-
-  # Assume your working directory is under `~/transformers`.
-  # To make this example work, please run the following command:
-  # git clone https://github.com/huggingface/transformers.git ~/transformers -b v4.30.1
-  workdir: ~/transformers
+    use_spot: true  # Use spot instances to save cost.
 
-  setup: |
+  envs:
     # Fill in your wandb key: copy from https://wandb.ai/authorize
     # Alternatively, you can use `--env WANDB_API_KEY=$WANDB_API_KEY`
     # to pass the key in the command line, during `sky jobs launch`.
-    echo export WANDB_API_KEY=[YOUR-WANDB-API-KEY] >> ~/.bashrc
+    WANDB_API_KEY:
+
+  # Assume your working directory is under `~/transformers`.
+  workdir: ~/transformers
 
+  setup: |
     pip install -e .
     cd examples/pytorch/question-answering/
     pip install -r requirements.txt torch==1.12.1+cu113 --extra-index-url https://download.pytorch.org/whl/cu113
     pip install wandb
 
   run: |
-    cd ./examples/pytorch/question-answering/
+    cd examples/pytorch/question-answering/
     python run_qa.py \
-    --model_name_or_path bert-base-uncased \
-    --dataset_name squad \
-    --do_train \
-    --do_eval \
-    --per_device_train_batch_size 12 \
-    --learning_rate 3e-5 \
-    --num_train_epochs 50 \
-    --max_seq_length 384 \
-    --doc_stride 128 \
-    --report_to wandb
-
+      --model_name_or_path bert-base-uncased \
+      --dataset_name squad \
+      --do_train \
+      --do_eval \
+      --per_device_train_batch_size 12 \
+      --learning_rate 3e-5 \
+      --num_train_epochs 50 \
+      --max_seq_length 384 \
+      --doc_stride 128 \
+      --report_to wandb \
+      --output_dir /tmp/bert_qa/
 
 .. note::
 
@@ -162,55 +160,52 @@ An End-to-End Example
 Below we show an `example <https://github.com/skypilot-org/skypilot/blob/master/examples/spot/bert_qa.yaml>`_ for fine-tuning a BERT model on a question-answering task with HuggingFace.
 
 .. code-block:: yaml
-  :emphasize-lines: 13-16,42-45
+  :emphasize-lines: 8-11,41-44
 
   # bert_qa.yaml
   name: bert-qa
 
   resources:
     accelerators: V100:1
-    use_spot: true
-
-  # Assume your working directory is under `~/transformers`.
-  # To make this example work, please run the following command:
-  # git clone https://github.com/huggingface/transformers.git ~/transformers -b v4.30.1
-  workdir: ~/transformers
+    use_spot: true  # Use spot instances to save cost.
 
   file_mounts:
     /checkpoint:
       name: # NOTE: Fill in your bucket name
       mode: MOUNT
 
-  setup: |
+  envs:
     # Fill in your wandb key: copy from https://wandb.ai/authorize
     # Alternatively, you can use `--env WANDB_API_KEY=$WANDB_API_KEY`
     # to pass the key in the command line, during `sky jobs launch`.
-    echo export WANDB_API_KEY=[YOUR-WANDB-API-KEY] >> ~/.bashrc
+    WANDB_API_KEY:
+
+  # Assume your working directory is under `~/transformers`.
+  workdir: ~/transformers
 
+  setup: |
     pip install -e .
     cd examples/pytorch/question-answering/
-    pip install -r requirements.txt
+    pip install -r requirements.txt torch==1.12.1+cu113 --extra-index-url https://download.pytorch.org/whl/cu113
     pip install wandb
 
   run: |
-    cd ./examples/pytorch/question-answering/
+    cd examples/pytorch/question-answering/
     python run_qa.py \
-    --model_name_or_path bert-base-uncased \
-    --dataset_name squad \
-    --do_train \
-    --do_eval \
-    --per_device_train_batch_size 12 \
-    --learning_rate 3e-5 \
-    --num_train_epochs 50 \
-    --max_seq_length 384 \
-    --doc_stride 128 \
-    --report_to wandb \
-    --run_name $SKYPILOT_TASK_ID \
-    --output_dir /checkpoint/bert_qa/ \
-    --save_total_limit 10 \
-    --save_steps 1000
-
-
+      --model_name_or_path bert-base-uncased \
+      --dataset_name squad \
+      --do_train \
+      --do_eval \
+      --per_device_train_batch_size 12 \
+      --learning_rate 3e-5 \
+      --num_train_epochs 50 \
+      --max_seq_length 384 \
+      --doc_stride 128 \
+      --report_to wandb \
+      --output_dir /checkpoint/bert_qa/ \
+      --run_name $SKYPILOT_TASK_ID \
+      --save_total_limit 10 \
+      --save_steps 1000
 
 As HuggingFace has built-in support for periodically checkpointing, we only need to pass the highlighted arguments for setting up
 the output directory and frequency of checkpointing (see more

diff --git a/docs/source/running-jobs/environment-variables.rst b/docs/source/running-jobs/environment-variables.rst
@@ -16,15 +16,26 @@ User-specified environment variables
 
 User-specified environment variables are useful for passing secrets and any arguments or configurations needed for your tasks. They are made available in ``file_mounts``, ``setup``, and ``run``.
 
-You can specify environment variables to be made available to a task in two ways:
+You can specify environment variables to be made available to a task in several ways:
 
 - ``envs`` field (dict) in a :ref:`task YAML <yaml-spec>`:
 
   .. code-block:: yaml
 
     envs:
       MYVAR: val
-  
+
+
+- ``--env-file`` flag in ``sky launch/exec`` :ref:`CLI <cli>`, which is a path to a `dotenv` file (takes precedence over the above):
+
+  .. code-block:: text
+
+    # sky launch example.yaml --env-file my_app.env
+    # cat my_app.env
+    MYVAR=val
+    WANDB_API_KEY=MY_WANDB_API_KEY
+    HF_TOKEN=MY_HF_TOKEN
+
 - ``--env`` flag in ``sky launch/exec`` :ref:`CLI <cli>` (takes precedence over the above)
 
 .. tip::
@@ -145,9 +156,9 @@ Environment variables for ``setup``
      - 0
    * - ``SKYPILOT_SETUP_NODE_IPS``
      - A string of IP addresses of the nodes in the cluster with the same order as the node ranks, where each line contains one IP address.
-     
+
        Note that this is not necessarily the same as the nodes in ``run`` stage: the ``setup`` stage runs on all nodes of the cluster, while the ``run`` stage can run on a subset of nodes.
-     -      
+     -
        .. code-block:: text
 
          1.2.3.4
@@ -158,19 +169,19 @@ Environment variables for ``setup``
      - 2
    * - ``SKYPILOT_TASK_ID``
      - A unique ID assigned to each task.
-       
-       This environment variable is available only when the task is submitted 
+
+       This environment variable is available only when the task is submitted
        with :code:`sky launch --detach-setup`, or run as a managed spot job.
-       
+
        Refer to the description in the :ref:`environment variables for run <env-vars-for-run>`.
      - sky-2023-07-06-21-18-31-563597_myclus_1
-     
+
        For managed spot jobs: sky-managed-2023-07-06-21-18-31-563597_my-job-name_1-0
    * - ``SKYPILOT_CLUSTER_INFO``
      - A JSON string containing information about the cluster. To access the information, you could parse the JSON string in bash ``echo $SKYPILOT_CLUSTER_INFO | jq .cloud`` or in Python :
 
        .. code-block:: python
-         
+
          import json
          json.loads(
            os.environ['SKYPILOT_CLUSTER_INFO']
@@ -200,7 +211,7 @@ Environment variables for ``run``
      - 0
    * - ``SKYPILOT_NODE_IPS``
      - A string of IP addresses of the nodes reserved to execute the task, where each line contains one IP address. Read more :ref:`here <dist-jobs>`.
-     - 
+     -
        .. code-block:: text
 
          1.2.3.4
@@ -221,13 +232,13 @@ Environment variables for ``run``
        If a task is run as a :ref:`managed spot job <spot-jobs>`, then all
        recoveries of that job will have the same ID value. The ID is in the format "sky-managed-<timestamp>_<job-name>(_<task-name>)_<job-id>-<task-id>", where ``<task-name>`` will appear when a pipeline is used, i.e., more than one task in a managed spot job. Read more :ref:`here <spot-jobs-end-to-end>`.
      - sky-2023-07-06-21-18-31-563597_myclus_1
-     
+
        For managed spot jobs: sky-managed-2023-07-06-21-18-31-563597_my-job-name_1-0
    * - ``SKYPILOT_CLUSTER_INFO``
      - A JSON string containing information about the cluster. To access the information, you could parse the JSON string in bash ``echo $SKYPILOT_CLUSTER_INFO | jq .cloud``  or in Python :
 
        .. code-block:: python
-         
+
          import json
          json.loads(
            os.environ['SKYPILOT_CLUSTER_INFO']

diff --git a/examples/oci/serve-http-cpu.yaml b/examples/oci/serve-http-cpu.yaml
@@ -0,0 +1,11 @@
+service:
+  readiness_probe: /
+  replicas: 2
+
+resources:
+  cloud: oci
+  region: us-sanjose-1
+  ports: 8080
+  cpus: 2+
+
+run: python -m http.server 8080
diff --git a/examples/oci/serve-qwen-7b.yaml b/examples/oci/serve-qwen-7b.yaml
@@ -0,0 +1,25 @@
+# service.yaml
+service:
+  readiness_probe: /v1/models
+  replicas: 2
+
+# Fields below describe each replica.
+resources:
+  cloud: oci
+  region: us-sanjose-1
+  ports: 8080
+  accelerators: {A10:1}
+
+setup: |
+  conda create -n vllm python=3.12 -y
+  conda activate vllm
+  pip install vllm
+  pip install vllm-flash-attn
+
+run: |
+  conda activate vllm
+  python -u -m vllm.entrypoints.openai.api_server \
+    --host 0.0.0.0 --port 8080 \
+    --model Qwen/Qwen2-7B-Instruct \
+    --served-model-name Qwen2-7B-Instruct \
+    --device=cuda --dtype auto --max-model-len=2048