Open
Description
🐛 Describe the bug
As shown on the dashboard, the avg_inference_latency (ms)
is skipped for LLM, and report only generate_time (ms)
instead.
Upon checking the iOS run for example, a LLM job will run three tests on-device to report different metrics:
test_load_llama_3_2_1b_llama3_fb16_pte_iOS_17_2_1_iPhone15_4
test_forward_llama_3_2_1b_llama3_fb16_pte_iOS_17_2_1_iPhone15_4
test_generate_llama_3_2_1b_llama3_fb16_pte_tokenizer_model_iOS_17_2_1_iPhone15_4
While a non-LLM job will only run the first two tests (test_load_ and test_forward_ ) instead.
See detailed jobs here:
- LLM: https://github.com/pytorch/executorch/actions/runs/13403521306/job/37441009799
- non-LLM: https://github.com/pytorch/executorch/actions/runs/13403521306/job/37441008720
Three things to get clarification in this task:
- Because
test_forward_*
is reported to both LLM and non-LLM, why isn't reported to the dash? - Let's annotate each metrics in the DB so users will know what exactly is measured by each.
3. Confirm if Android is measuring and reporting exact same metricsReport avg_inference_latency from Android LLM benchmark app #8578
Versions
trunk
Metadata
Metadata
Labels
Type
Projects
Status
To triage
Status
In Progress