Add rocm perf yml file #418

Ruturaj4 · 2025-05-12T15:43:44Z

This PR adds a new GitHub Actions workflow that:

Builds JAX with ROCm support inside a Docker container.
Runs training for the following MaxText models:

llama2_7b
gemma_2b
gpt3_6b
mixtral_8x1b

Captures stdout logs for each model and extracts per-step timing

Ignores step 0 (warmup) when computing metrics

Computes median_step_time per model and saves it to summary.json

Uploads logs and metrics as workflow artifacts

A Python analysis script (analyze_maxtext_logs.py) is added under jax/build/rocm/ to parse logs and generate the summary.

mrodden · 2025-05-12T18:40:46Z

.github/workflows/rocm-perf.yml

+                times.append(float(m.group(1)))
+    if times:
+        summary[model] = {
+            "median_step_time": round(float(np.median(times)), 3),
+            "steps_counted": len(times)
+        }


grab the parsed steps too with the summary

Suggested change

times.append(float(m.group(1)))

if times:

summary[model] = {

"median_step_time": round(float(np.median(times)), 3),

"steps_counted": len(times)

}

times.append(float(m.group(1)))

if times:

step_info = list([{"step": n, "time": t} for n,t in enumerate(times)])

summary[model] = {

"steps": step_info,

"median_step_time": round(float(np.median(times)), 3),

"steps_counted": len(times)

}

sounds good

mrodden · 2025-05-12T18:41:53Z

.github/workflows/rocm-perf.yml

+
+      - name: Run MaxText training and save logs
+        run: |
+          docker exec maxtext_container bash -c "pip install -r requirements.txt"


I could see this pip install being an issue across jax versions

ahh, no, it will be different for different branches used in rocm/maxtext, for e.g. currently I am using rv_jax, but I will be renaming it to jax_0.5.0 or something

charleshofer · 2025-05-13T17:42:08Z

This is a port from the other performance CI PR, right? Could you add a description and link to the original PR?

i-chaochen · 2025-05-21T14:42:56Z

are we considering of grok and alphafold models? @Ruturaj4 @JehandadKhan

Yes, we are def planning to add alphafold, however grok testing takes too much time to download weights. If grok training can be done or if there are ways to run grok faster, we are happy to add those as well!

Arech8 · 2025-05-22T11:07:27Z

Why did you chose to report median step time?

I don't know the rationale for that, but in general, I'm not sure that median is a correct metric here. It rejects outliers and alone it totally doesn't describe distribution of values, but that is exactly what is important to know:

if there are important outliers (in any direction):
- it's important to investigate them. Like, if there's some java-like garbage collection step that make an app totally unresponsive, - this is just a no-go in many contexts. This is still true for model training/inferencing.
- outliers significantly influence total runtime and a perception of "fast" or "slow" for users.
robust statistics which median is a part of, are the best in describing shapes of any distributions, however no single metric of robust statistics is able to do that in isolation: several of them must be used. At least, min + max also, but quartiles (25% + 75%) are generally super useful also.
- if you want one value, a mean is much better in that as it contains equally weighted information from all samples, while median describes only 1 or 2 samples at best, leaving just nothing about the rest.
- mean have additional nice property that it allows to forecast a total runtime in a different circumstances. For example, if you've measured from 100 epochs that your average time per epoch is 1s, then you have some reasons to expect that 1000 epochs will last 1000 seconds. If you do the same for median - you can say nothing even about the next 100 epoch run.

TLDR: mean metric seem much better here. For the best results, I'd make 6 values: [0, 25, 50, 75, 100]% quantiles + mean too (b/c of the last bullet point) (and God forbid of stddev)

Arech8 · 2025-05-30T11:43:22Z

.github/workflows/rocm-perf.yml

+        run: |
+          docker exec maxtext_container bash -c "pip install -r requirements.txt"
+          for config in \
+            MaxText/configs/models/gpu/llama2_7b_rocm.yml \


I guess, batch sizes & whatnot config values in the yamls are tailored for MI250 (as per CI machine spec above) ?
If so and the branch is going to also be used anywhere else (like merged to main), then perhaps it's wise to rename the files to change _rocm suffix to _MI250 to make the tailoring apparent.
MI250 already implies ROCm, but ROCm only implies every GPU officially supported by ROCm, which might not even be true in case of the smaller Instincts that might just not cope with batch sizes & whatnot used..

Ruturaj4 closed this May 12, 2025

Ruturaj4 reopened this May 12, 2025

mrodden reviewed May 12, 2025

View reviewed changes

Add rocm perf yml file

952cc4f

Ruturaj4 force-pushed the bring_rocm_dlm_perf branch from d2912f5 to 952cc4f Compare May 13, 2025 12:13

Arech8 reviewed May 30, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add rocm perf yml file #418

Add rocm perf yml file #418

Uh oh!

Ruturaj4 commented May 12, 2025 •

edited

Loading

Uh oh!

mrodden May 12, 2025

Uh oh!

Ruturaj4 May 15, 2025

Uh oh!

mrodden May 12, 2025

Uh oh!

Ruturaj4 May 15, 2025

Uh oh!

charleshofer commented May 13, 2025

Uh oh!

i-chaochen commented May 21, 2025 •

edited by Ruturaj4

Loading

Uh oh!

Arech8 commented May 22, 2025 •

edited

Loading

Uh oh!

Arech8 May 30, 2025

Uh oh!

Uh oh!

Add rocm perf yml file #418

Are you sure you want to change the base?

Add rocm perf yml file #418

Uh oh!

Conversation

Ruturaj4 commented May 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mrodden May 12, 2025

Choose a reason for hiding this comment

Uh oh!

Ruturaj4 May 15, 2025

Choose a reason for hiding this comment

Uh oh!

mrodden May 12, 2025

Choose a reason for hiding this comment

Uh oh!

Ruturaj4 May 15, 2025

Choose a reason for hiding this comment

Uh oh!

charleshofer commented May 13, 2025

Uh oh!

i-chaochen commented May 21, 2025 • edited by Ruturaj4 Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Arech8 commented May 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Arech8 May 30, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Ruturaj4 commented May 12, 2025 •

edited

Loading

i-chaochen commented May 21, 2025 •

edited by Ruturaj4

Loading

Arech8 commented May 22, 2025 •

edited

Loading