Skip to content

Commit 2a4d641

Browse files
Add a batched auto tune script (#25076)
Signed-off-by: Karan Goel <karangoel@google.com> Signed-off-by: Karan Goel <3261985+karan@users.noreply.github.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
1 parent e67a79d commit 2a4d641

File tree

2 files changed

+195
-0
lines changed

2 files changed

+195
-0
lines changed

benchmarks/auto_tune/README.md

Lines changed: 67 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -149,3 +149,70 @@ The script follows a systematic process to find the optimal parameters:
149149
4. **Track Best Result**: Throughout the process, the script tracks the parameter combination that has yielded the highest valid throughput so far.
150150

151151
5. **Profile Collection**: For the best-performing run, the script saves the vLLM profiler output, which can be used for deep-dive performance analysis with tools like TensorBoard.
152+
153+
## Batched `auto_tune`
154+
155+
The `batch_auto_tune.sh` script allows you to run multiple `auto_tune.sh` experiments sequentially from a single configuration file. It iterates through a list of parameter sets, executes `auto_tune.sh` for each, and records the results back into the input file.
156+
157+
### Prerequisites
158+
159+
- **jq**: This script requires `jq` to parse the JSON configuration file.
160+
- **gcloud**: If you plan to upload results to Google Cloud Storage, the `gcloud` CLI must be installed and authenticated.
161+
162+
### How to Run
163+
164+
1. **Create a JSON configuration file**: Create a file (e.g., `runs_config.json`) containing an array of JSON objects. Each object defines the parameters for a single `auto_tune.sh` run.
165+
166+
2. **Execute the script**:
167+
168+
```bash
169+
bash batch_auto_tune.sh <path_to_json_file> [gcs_upload_path]
170+
```
171+
172+
- `<path_to_json_file>`: **Required.** Path to your JSON configuration file.
173+
- `[gcs_upload_path]`: **Optional.** A GCS path (e.g., `gs://my-bucket/benchmark-results`) where the detailed results and profiles for each run will be uploaded. If this is empty, the results will be available on the local filesystem (see the log for `RESULT_FILE=/path/to/results/file.txt`).
174+
175+
### Configuration File
176+
177+
The JSON configuration file should contain an array of objects. Each object's keys correspond to the configuration variables for `auto_tune.sh` (see the [Configuration table above](#configuration)). These keys will be converted to uppercase environment variables for each run.
178+
179+
Here is an example `runs_config.json` with two benchmark configurations:
180+
181+
```json
182+
[
183+
{
184+
"base": "/home/user",
185+
"model": "meta-llama/Llama-3.1-8B-Instruct",
186+
"system": "TPU", # OR GPU
187+
"tp": 8,
188+
"input_len": 128,
189+
"output_len": 2048,
190+
"max_model_len": 2300,
191+
"num_seqs_list": "128 256",
192+
"num_batched_tokens_list": "8192 16384"
193+
},
194+
{
195+
"base": "/home/user",
196+
"model": "meta-llama/Llama-3.1-70B-Instruct",
197+
"system": "TPU", # OR GPU
198+
"tp": 8,
199+
"input_len": 4000,
200+
"output_len": 16,
201+
"max_model_len": 4096,
202+
"num_seqs_list": "64 128",
203+
"num_batched_tokens_list": "4096 8192",
204+
"max_latency_allowed_ms": 500
205+
}
206+
]
207+
```
208+
209+
### Output
210+
211+
The script modifies the input JSON file in place, adding the results of each run to the corresponding object. The following fields are added:
212+
213+
- `run_id`: A unique identifier for the run, derived from the timestamp.
214+
- `status`: The outcome of the run (`SUCCESS`, `FAILURE`, or `WARNING_NO_RESULT_FILE`).
215+
- `results`: The content of the `result.txt` file from the `auto_tune.sh` run.
216+
- `gcs_results`: The GCS URL where the run's artifacts are stored (if a GCS path was provided).
217+
218+
A summary of successful and failed runs is also printed to the console upon completion.
Lines changed: 128 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,128 @@
1+
#!/bin/bash
2+
3+
INPUT_JSON="$1"
4+
GCS_PATH="$2" # Optional GCS path for uploading results for each run
5+
6+
SCRIPT_DIR=$(cd -- "$(dirname -- "${BASH_SOURCE[0]}")" &>/dev/null && pwd)
7+
AUTOTUNE_SCRIPT="$SCRIPT_DIR/auto_tune.sh"
8+
9+
if [[ -z "$INPUT_JSON" ]]; then
10+
echo "Error: Input JSON file not provided."
11+
echo "Usage: $0 <path_to_json_file> [gcs_upload_path]"
12+
exit 1
13+
fi
14+
15+
if [[ ! -f "$INPUT_JSON" ]]; then
16+
echo "Error: File not found at '$INPUT_JSON'"
17+
exit 1
18+
fi
19+
20+
if ! command -v jq &> /dev/null; then
21+
echo "Error: 'jq' command not found. Please install jq to process the JSON input."
22+
exit 1
23+
fi
24+
25+
if [[ -n "$GCS_PATH" ]] && ! command -v gcloud &> /dev/null; then
26+
echo "Error: 'gcloud' command not found, but a GCS_PATH was provided."
27+
exit 1
28+
fi
29+
30+
SUCCESS_COUNT=0
31+
FAILURE_COUNT=0
32+
FAILED_RUNS=()
33+
SCRIPT_START_TIME=$(date +%s)
34+
35+
json_content=$(cat "$INPUT_JSON")
36+
if ! num_runs=$(echo "$json_content" | jq 'length'); then
37+
echo "Error: Invalid JSON in $INPUT_JSON. 'jq' failed to get array length." >&2
38+
exit 1
39+
fi
40+
41+
echo "Found $num_runs benchmark configurations in $INPUT_JSON."
42+
echo "Starting benchmark runs..."
43+
echo "--------------------------------------------------"
44+
45+
for i in $(seq 0 $(($num_runs - 1))); do
46+
run_object=$(echo "$json_content" | jq ".[$i]")
47+
48+
RUN_START_TIME=$(date +%s)
49+
ENV_VARS_ARRAY=()
50+
# Dynamically create env vars from the JSON object's keys
51+
for key in $(echo "$run_object" | jq -r 'keys_unsorted[]'); do
52+
value=$(echo "$run_object" | jq -r ".$key")
53+
var_name=$(echo "$key" | tr '[:lower:]' '[:upper:]' | tr -cd 'A-Z0-9_')
54+
ENV_VARS_ARRAY+=("${var_name}=${value}")
55+
done
56+
57+
echo "Executing run #$((i+1))/$num_runs with parameters: ${ENV_VARS_ARRAY[*]}"
58+
59+
# Execute auto_tune.sh and capture output
60+
RUN_OUTPUT_FILE=$(mktemp)
61+
if env "${ENV_VARS_ARRAY[@]}" bash "$AUTOTUNE_SCRIPT" > >(tee -a "$RUN_OUTPUT_FILE") 2>&1; then
62+
STATUS="SUCCESS"
63+
((SUCCESS_COUNT++))
64+
else
65+
STATUS="FAILURE"
66+
((FAILURE_COUNT++))
67+
FAILED_RUNS+=("Run #$((i+1)): $(echo $run_object | jq -c .)")
68+
fi
69+
70+
RUN_OUTPUT=$(<"$RUN_OUTPUT_FILE")
71+
rm "$RUN_OUTPUT_FILE"
72+
73+
# Parse results and optionally upload them to GCS
74+
RUN_ID=""
75+
RESULTS=""
76+
GCS_RESULTS_URL=""
77+
if [[ "$STATUS" == "SUCCESS" ]]; then
78+
RESULT_FILE_PATH=$(echo "$RUN_OUTPUT" | grep 'RESULT_FILE=' | tail -n 1 | cut -d'=' -f2 | tr -s '/' || true)
79+
80+
if [[ -n "$RESULT_FILE_PATH" && -f "$RESULT_FILE_PATH" ]]; then
81+
RUN_ID=$(basename "$(dirname "$RESULT_FILE_PATH")")
82+
RESULT_DIR=$(dirname "$RESULT_FILE_PATH")
83+
RESULTS=$(cat "$RESULT_FILE_PATH")
84+
85+
if [[ -n "$GCS_PATH" ]]; then
86+
GCS_RESULTS_URL="${GCS_PATH}/${RUN_ID}"
87+
echo "Uploading results to GCS..."
88+
if gcloud storage rsync --recursive "$RESULT_DIR/" "$GCS_RESULTS_URL"; then
89+
echo "GCS upload successful."
90+
else
91+
echo "Warning: GCS upload failed for RUN_ID $RUN_ID."
92+
fi
93+
fi
94+
else
95+
echo "Warning: Could not find result file for a successful run."
96+
STATUS="WARNING_NO_RESULT_FILE"
97+
fi
98+
fi
99+
100+
# Add the results back into the JSON object for this run
101+
json_content=$(echo "$json_content" | jq --argjson i "$i" --arg run_id "$RUN_ID" --arg status "$STATUS" --arg results "$RESULTS" --arg gcs_results "$GCS_RESULTS_URL" \
102+
'.[$i] += {run_id: $run_id, status: $status, results: $results, gcs_results: $gcs_results}')
103+
104+
RUN_END_TIME=$(date +%s)
105+
echo "Run finished in $((RUN_END_TIME - RUN_START_TIME)) seconds. Status: $STATUS"
106+
echo "--------------------------------------------------"
107+
108+
# Save intermediate progress back to the file
109+
echo "$json_content" > "$INPUT_JSON.tmp" && mv "$INPUT_JSON.tmp" "$INPUT_JSON"
110+
111+
done
112+
113+
SCRIPT_END_TIME=$(date +%s)
114+
echo "All benchmark runs completed in $((SCRIPT_END_TIME - SCRIPT_START_TIME)) seconds."
115+
echo
116+
echo "====================== SUMMARY ======================"
117+
echo "Successful runs: $SUCCESS_COUNT"
118+
echo "Failed runs: $FAILURE_COUNT"
119+
echo "==================================================="
120+
121+
if [[ $FAILURE_COUNT -gt 0 ]]; then
122+
echo "Details of failed runs (see JSON file for full parameters):"
123+
for failed in "${FAILED_RUNS[@]}"; do
124+
echo " - $failed"
125+
done
126+
fi
127+
128+
echo "Updated results have been saved to '$INPUT_JSON'."

0 commit comments

Comments
 (0)