Skip to content

[Not To Land] Script for benchmark satbility assessment #10982

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

guangy10
Copy link
Contributor

Summary

The custom script for ET benchmark stability assessment.

pip install openpyxl tabulate matplotlib

Then

python extension/benchmark/analyze_latency_stability.py Benchmark\ Dataset\ with\ Private\ AWS\ Devices.xlsx --reference_file Benchmark\ Dataset\ with\ Public\ AWS\ Devices.xlsx

The generated analysis:

Analyzing latency stability from primary file: /Users/guangyang/Desktop/Benchmark Dataset with Private AWS Devices.xlsx
Using reference file for comparison: /Users/guangyang/Desktop/Benchmark Dataset with Public AWS Devices.xlsx


====================================================================================================
===== LOADING PRIMARY DATASETS (Private) ==========================================================
====================================================================================================

Loading dataset: llama3_qlora+s22_android13
Loading dataset: llama3_spinq+s22_android13
Loading dataset: mv3_qnn+s22_android13
Loading dataset: mv3_xnnq8+s22_android13
Loading dataset: llama3_qlora+s22ultra_android14
Loading dataset: llama3_spinq+s22ultra_android14
Loading dataset: mv3_qnn+s22ultra_android14
Loading dataset: mv3_xnnq8+s22ultra_android14
Loading dataset: mv3_xnnq8+pixel3_rooted_android
Loading dataset: llama3_qlora+iphone15max_ios17
Loading dataset: llama3_spinq+iphone15max_ios17
Loading dataset: mv3_xnnq8+iphone15max_ios17


====================================================================================================
===== LOADING REFERENCE DATASETS (Public) =========================================================
====================================================================================================

Loading reference dataset: mv3_xnnq8+s22_android13
Loading reference dataset: mv3_qnn+s22_android13
Loading reference dataset: mv3_xnnq8+s22ultra_android14
Loading reference dataset: mv3_qnn+s22ultra_android14
Loading reference dataset: llama3_qlora+iphone15max_ios17
Loading reference dataset: llama3_spinq+iphone15max_ios17
Loading reference dataset: mv3_spinq+iphone15max_ios17
Warning: Minimum latency value is zero, max/min ratio set to infinity


====================================================================================================
===== LATENCY STABILITY ANALYSIS - PRIMARY DATASETS ===============================================
====================================================================================================


Latency Stability Analysis: llama3_qlora+s22_android13 (Primary)
================================================================================
Model: llama3_qlora
Device: s22_android13

Dataset Overview:
  - Number of samples: 88
  - Date range: 2025-04-29 09:48:57+00:00 to 2025-05-13 21:08:36+00:00

Central Tendency Metrics:
  - Mean latency: 22502.10 ms
  - Median latency (P50): 22447.56 ms
  - Mean trimmed latency: 22388.87 ms
  - Median trimmed latency: 22343.47 ms

Dispersion Metrics:
  - Standard deviation: 595.01 ms
  - Coefficient of variation (CV): 2.64%
  - Interquartile range (IQR): 858.26 ms
  - Trimmed standard deviation: 596.25 ms
  - Trimmed coefficient of variation: 2.66%

Percentile Metrics:
  - P50 (median): 22447.56 ms
  - P90: 23231.99 ms
  - P95: 23518.35 ms
  - P99: 23910.11 ms

Inter-Jitter Metrics (variability between runs):
  - Max/Min ratio: 1.1423
  - P99/P50 ratio: 1.0652
  - Mean rolling std (window=5): 539.36 ms

Intra-Jitter Metrics (variability within runs):
  - Mean trimming effect ratio: 0.50%
  - Max trimming effect ratio: 0.81%

Throughput Metrics:
  - Mean TPS: 33.07
  - TPS coefficient of variation: 6.92%

Stability Assessment:
  - Overall stability score: 83.4/100
  - Overall stability rating: Good

Interpretation:
  The benchmark shows good stability (score: 83.4/100) with low
  variation between runs (CV: 2.64%).
  Performance is consistent and predictable for most use cases.

================================================================================
Generated time series plot: stability_analysis_results/llama3_qlora+s22_android13_primary_time_series.png

Latency Stability Analysis: llama3_spinq+s22_android13 (Primary)
================================================================================
Model: llama3_spinq
Device: s22_android13

Dataset Overview:
  - Number of samples: 88
  - Date range: 2025-04-29 09:48:57+00:00 to 2025-05-13 21:08:36+00:00

Central Tendency Metrics:
  - Mean latency: 21771.59 ms
  - Median latency (P50): 21668.24 ms
  - Mean trimmed latency: 21662.53 ms
  - Median trimmed latency: 21559.89 ms

Dispersion Metrics:
  - Standard deviation: 514.89 ms
  - Coefficient of variation (CV): 2.36%
  - Interquartile range (IQR): 602.75 ms
  - Trimmed standard deviation: 515.03 ms
  - Trimmed coefficient of variation: 2.38%

Percentile Metrics:
  - P50 (median): 21668.24 ms
  - P90: 22438.74 ms
  - P95: 22542.42 ms
  - P99: 23104.76 ms

Inter-Jitter Metrics (variability between runs):
  - Max/Min ratio: 1.1452
  - P99/P50 ratio: 1.0663
  - Mean rolling std (window=5): 449.10 ms

Intra-Jitter Metrics (variability within runs):
  - Mean trimming effect ratio: 0.50%
  - Max trimming effect ratio: 0.89%

Throughput Metrics:
  - Mean TPS: 33.76
  - TPS coefficient of variation: 4.70%

Stability Assessment:
  - Overall stability score: 84.7/100
  - Overall stability rating: Good

Interpretation:
  The benchmark shows good stability (score: 84.7/100) with low
  variation between runs (CV: 2.36%).
  Performance is consistent and predictable for most use cases.

================================================================================
Generated time series plot: stability_analysis_results/llama3_spinq+s22_android13_primary_time_series.png

Latency Stability Analysis: mv3_qnn+s22_android13 (Primary)
================================================================================
Model: mv3_qnn
Device: s22_android13

Dataset Overview:
  - Number of samples: 100
  - Date range: 2025-04-29 09:48:57+00:00 to 2025-05-15 21:14:41+00:00

Central Tendency Metrics:
  - Mean latency: 1.01 ms
  - Median latency (P50): 1.00 ms
  - Mean trimmed latency: 1.00 ms
  - Median trimmed latency: 1.00 ms

Dispersion Metrics:
  - Standard deviation: 0.02 ms
  - Coefficient of variation (CV): 2.34%
  - Interquartile range (IQR): 0.01 ms
  - Trimmed standard deviation: 0.02 ms
  - Trimmed coefficient of variation: 2.27%

Percentile Metrics:
  - P50 (median): 1.00 ms
  - P90: 1.01 ms
  - P95: 1.01 ms
  - P99: 1.14 ms

Inter-Jitter Metrics (variability between runs):
  - Max/Min ratio: 1.1919
  - P99/P50 ratio: 1.1404
  - Mean rolling std (window=5): 0.01 ms

Intra-Jitter Metrics (variability within runs):
  - Mean trimming effect ratio: 0.19%
  - Max trimming effect ratio: 1.00%

Stability Assessment:
  - Overall stability score: 82.4/100
  - Overall stability rating: Good

Interpretation:
  The benchmark shows good stability (score: 82.4/100) with low
  variation between runs (CV: 2.34%).
  Performance is consistent and predictable for most use cases.

  The P99/P50 ratio of 1.14 suggests
  occasional latency spikes that could affect tail latency sensitive applications.

================================================================================
Generated time series plot: stability_analysis_results/mv3_qnn+s22_android13_primary_time_series.png

Latency Stability Analysis: mv3_xnnq8+s22_android13 (Primary)
================================================================================
Model: mv3_xnnq8
Device: s22_android13

Dataset Overview:
  - Number of samples: 88
  - Date range: 2025-04-29 09:48:57+00:00 to 2025-05-13 21:08:36+00:00

Central Tendency Metrics:
  - Mean latency: 2.73 ms
  - Median latency (P50): 2.65 ms
  - Mean trimmed latency: 2.22 ms
  - Median trimmed latency: 2.10 ms

Dispersion Metrics:
  - Standard deviation: 0.63 ms
  - Coefficient of variation (CV): 23.03%
  - Interquartile range (IQR): 0.95 ms
  - Trimmed standard deviation: 0.36 ms
  - Trimmed coefficient of variation: 15.98%

Percentile Metrics:
  - P50 (median): 2.65 ms
  - P90: 3.59 ms
  - P95: 3.74 ms
  - P99: 4.46 ms

Inter-Jitter Metrics (variability between runs):
  - Max/Min ratio: 2.4427
  - P99/P50 ratio: 1.6812
  - Mean rolling std (window=5): 0.60 ms

Intra-Jitter Metrics (variability within runs):
  - Mean trimming effect ratio: 16.52%
  - Max trimming effect ratio: 36.96%

Stability Assessment:
  - Overall stability score: 14.9/100
  - Overall stability rating: Poor

Interpretation:
  The benchmark shows poor stability (score: 14.9/100) with significant
  variation between runs (CV: 23.03%).
  Performance is unpredictable and may lead to inconsistent user experience.

  The significant difference between raw and trimmed means suggests
  considerable intra-run jitter (16.5%) with occasional outliers within benchmark runs.

  The max/min ratio of 2.44 indicates
  substantial performance differences between the best and worst runs.

  The P99/P50 ratio of 1.68 suggests
  occasional latency spikes that could affect tail latency sensitive applications.

================================================================================
Generated time series plot: stability_analysis_results/mv3_xnnq8+s22_android13_primary_time_series.png

Latency Stability Analysis: llama3_qlora+s22ultra_android14 (Primary)
================================================================================
Model: llama3_qlora
Device: s22ultra_android14

Dataset Overview:
  - Number of samples: 88
  - Date range: 2025-04-29 09:48:57+00:00 to 2025-05-13 21:08:36+00:00

Central Tendency Metrics:
  - Mean latency: 25022.84 ms
  - Median latency (P50): 25427.33 ms
  - Mean trimmed latency: 24748.06 ms
  - Median trimmed latency: 25062.01 ms

Dispersion Metrics:
  - Standard deviation: 1545.62 ms
  - Coefficient of variation (CV): 6.18%
  - Interquartile range (IQR): 2844.11 ms
  - Trimmed standard deviation: 1467.60 ms
  - Trimmed coefficient of variation: 5.93%

Percentile Metrics:
  - P50 (median): 25427.33 ms
  - P90: 26581.31 ms
  - P95: 27184.07 ms
  - P99: 28668.97 ms

Inter-Jitter Metrics (variability between runs):
  - Max/Min ratio: 1.2710
  - P99/P50 ratio: 1.1275
  - Mean rolling std (window=5): 1560.71 ms

Intra-Jitter Metrics (variability within runs):
  - Mean trimming effect ratio: 1.08%
  - Max trimming effect ratio: 4.80%

Throughput Metrics:
  - Mean TPS: 28.35
  - TPS coefficient of variation: 7.88%

Stability Assessment:
  - Overall stability score: 62.5/100
  - Overall stability rating: Moderate

Interpretation:
  The benchmark shows moderate stability (score: 62.5/100) with noticeable
  variation between runs (CV: 6.18%).
  While average performance is acceptable, occasional latency spikes may occur.

  The max/min ratio of 1.27 indicates
  substantial performance differences between the best and worst runs.

  The P99/P50 ratio of 1.13 suggests
  occasional latency spikes that could affect tail latency sensitive applications.

================================================================================
Generated time series plot: stability_analysis_results/llama3_qlora+s22ultra_android14_primary_time_series.png

Latency Stability Analysis: llama3_spinq+s22ultra_android14 (Primary)
================================================================================
Model: llama3_spinq
Device: s22ultra_android14

Dataset Overview:
  - Number of samples: 88
  - Date range: 2025-04-29 09:48:57+00:00 to 2025-05-13 21:08:36+00:00

Central Tendency Metrics:
  - Mean latency: 24761.78 ms
  - Median latency (P50): 25043.89 ms
  - Mean trimmed latency: 24466.21 ms
  - Median trimmed latency: 24731.04 ms

Dispersion Metrics:
  - Standard deviation: 1552.25 ms
  - Coefficient of variation (CV): 6.27%
  - Interquartile range (IQR): 1931.42 ms
  - Trimmed standard deviation: 1466.19 ms
  - Trimmed coefficient of variation: 5.99%

Percentile Metrics:
  - P50 (median): 25043.89 ms
  - P90: 26163.60 ms
  - P95: 26948.68 ms
  - P99: 28868.51 ms

Inter-Jitter Metrics (variability between runs):
  - Max/Min ratio: 1.3648
  - P99/P50 ratio: 1.1527
  - Mean rolling std (window=5): 1451.05 ms

Intra-Jitter Metrics (variability within runs):
  - Mean trimming effect ratio: 1.17%
  - Max trimming effect ratio: 4.90%

Throughput Metrics:
  - Mean TPS: 29.85
  - TPS coefficient of variation: 8.24%

Stability Assessment:
  - Overall stability score: 60.3/100
  - Overall stability rating: Moderate

Interpretation:
  The benchmark shows moderate stability (score: 60.3/100) with noticeable
  variation between runs (CV: 6.27%).
  While average performance is acceptable, occasional latency spikes may occur.

  The max/min ratio of 1.36 indicates
  substantial performance differences between the best and worst runs.

  The P99/P50 ratio of 1.15 suggests
  occasional latency spikes that could affect tail latency sensitive applications.

================================================================================
Generated time series plot: stability_analysis_results/llama3_spinq+s22ultra_android14_primary_time_series.png

Latency Stability Analysis: mv3_qnn+s22ultra_android14 (Primary)
================================================================================
Model: mv3_qnn
Device: s22ultra_android14

Dataset Overview:
  - Number of samples: 100
  - Date range: 2025-04-29 09:48:57+00:00 to 2025-05-15 21:14:41+00:00

Central Tendency Metrics:
  - Mean latency: 1.01 ms
  - Median latency (P50): 1.01 ms
  - Mean trimmed latency: 1.01 ms
  - Median trimmed latency: 1.01 ms

Dispersion Metrics:
  - Standard deviation: 0.01 ms
  - Coefficient of variation (CV): 0.91%
  - Interquartile range (IQR): 0.01 ms
  - Trimmed standard deviation: 0.01 ms
  - Trimmed coefficient of variation: 0.70%

Percentile Metrics:
  - P50 (median): 1.01 ms
  - P90: 1.02 ms
  - P95: 1.02 ms
  - P99: 1.03 ms

Inter-Jitter Metrics (variability between runs):
  - Max/Min ratio: 1.0900
  - P99/P50 ratio: 1.0204
  - Mean rolling std (window=5): 0.01 ms

Intra-Jitter Metrics (variability within runs):
  - Mean trimming effect ratio: 0.19%
  - Max trimming effect ratio: 1.94%

Stability Assessment:
  - Overall stability score: 93.8/100
  - Overall stability rating: Excellent

Interpretation:
  The benchmark shows excellent stability (score: 93.8/100) with very low
  variation between runs (CV: 0.91%).
  This indicates highly consistent performance suitable for latency-sensitive applications.

================================================================================
Generated time series plot: stability_analysis_results/mv3_qnn+s22ultra_android14_primary_time_series.png

Latency Stability Analysis: mv3_xnnq8+s22ultra_android14 (Primary)
================================================================================
Model: mv3_xnnq8
Device: s22ultra_android14

Dataset Overview:
  - Number of samples: 88
  - Date range: 2025-04-29 09:48:57+00:00 to 2025-05-13 21:08:36+00:00

Central Tendency Metrics:
  - Mean latency: 2.91 ms
  - Median latency (P50): 2.54 ms
  - Mean trimmed latency: 2.41 ms
  - Median trimmed latency: 2.15 ms

Dispersion Metrics:
  - Standard deviation: 1.14 ms
  - Coefficient of variation (CV): 39.08%
  - Interquartile range (IQR): 0.82 ms
  - Trimmed standard deviation: 0.76 ms
  - Trimmed coefficient of variation: 31.60%

Percentile Metrics:
  - P50 (median): 2.54 ms
  - P90: 3.88 ms
  - P95: 4.60 ms
  - P99: 5.91 ms

Inter-Jitter Metrics (variability between runs):
  - Max/Min ratio: 5.6103
  - P99/P50 ratio: 2.3319
  - Mean rolling std (window=5): 0.79 ms

Intra-Jitter Metrics (variability within runs):
  - Mean trimming effect ratio: 15.37%
  - Max trimming effect ratio: 38.83%

Stability Assessment:
  - Overall stability score: 0.0/100
  - Overall stability rating: Poor

Interpretation:
  The benchmark shows poor stability (score: 0.0/100) with significant
  variation between runs (CV: 39.08%).
  Performance is unpredictable and may lead to inconsistent user experience.

  The significant difference between raw and trimmed means suggests
  considerable intra-run jitter (15.4%) with occasional outliers within benchmark runs.

  The max/min ratio of 5.61 indicates
  substantial performance differences between the best and worst runs.

  The P99/P50 ratio of 2.33 suggests
  occasional latency spikes that could affect tail latency sensitive applications.

================================================================================
Generated time series plot: stability_analysis_results/mv3_xnnq8+s22ultra_android14_primary_time_series.png

Latency Stability Analysis: mv3_xnnq8+pixel3_rooted_android (Primary)
================================================================================
Model: mv3_xnnq8
Device: pixel3_rooted_android

Dataset Overview:
  - Number of samples: 148
  - Date range: 2025-04-16 02:47:21+00:00 to 2025-04-29 01:17:49+00:00

Central Tendency Metrics:
  - Mean latency: 5.93 ms
  - Median latency (P50): 5.87 ms
  - Mean trimmed latency: 5.51 ms
  - Median trimmed latency: 5.45 ms

Dispersion Metrics:
  - Standard deviation: 0.46 ms
  - Coefficient of variation (CV): 7.68%
  - Interquartile range (IQR): 0.56 ms
  - Trimmed standard deviation: 0.27 ms
  - Trimmed coefficient of variation: 4.84%

Percentile Metrics:
  - P50 (median): 5.87 ms
  - P90: 6.44 ms
  - P95: 6.57 ms
  - P99: 7.26 ms

Inter-Jitter Metrics (variability between runs):
  - Max/Min ratio: 1.6964
  - P99/P50 ratio: 1.2386
  - Mean rolling std (window=5): 0.41 ms

Intra-Jitter Metrics (variability within runs):
  - Mean trimming effect ratio: 6.66%
  - Max trimming effect ratio: 26.67%

Stability Assessment:
  - Overall stability score: 46.9/100
  - Overall stability rating: Poor

Interpretation:
  The benchmark shows poor stability (score: 46.9/100) with significant
  variation between runs (CV: 7.68%).
  Performance is unpredictable and may lead to inconsistent user experience.

  The significant difference between raw and trimmed means suggests
  considerable intra-run jitter (6.7%) with occasional outliers within benchmark runs.

  The max/min ratio of 1.70 indicates
  substantial performance differences between the best and worst runs.

  The P99/P50 ratio of 1.24 suggests
  occasional latency spikes that could affect tail latency sensitive applications.

================================================================================
Generated time series plot: stability_analysis_results/mv3_xnnq8+pixel3_rooted_android_primary_time_series.png

Latency Stability Analysis: llama3_qlora+iphone15max_ios17 (Primary)
================================================================================
Model: llama3_qlora
Device: iphone15max_ios17

Dataset Overview:
  - Number of samples: 74
  - Date range: 2025-04-29 21:26:38+00:00 to 2025-05-10 09:24:40+00:00

Central Tendency Metrics:
  - Mean latency: 15239.42 ms
  - Median latency (P50): 13167.50 ms

Dispersion Metrics:
  - Standard deviation: 4566.55 ms
  - Coefficient of variation (CV): 29.97%
  - Interquartile range (IQR): 2261.25 ms

Percentile Metrics:
  - P50 (median): 13167.50 ms
  - P90: 21784.50 ms
  - P95: 25082.10 ms
  - P99: 31016.48 ms

Inter-Jitter Metrics (variability between runs):
  - Max/Min ratio: 2.5011
  - P99/P50 ratio: 2.3555
  - Mean rolling std (window=5): 3114.57 ms

Throughput Metrics:
  - Mean TPS: 8.34
  - TPS coefficient of variation: 39.55%

Stability Assessment:
  - Overall stability score: 6.2/100
  - Overall stability rating: Poor

Interpretation:
  The benchmark shows poor stability (score: 6.2/100) with significant
  variation between runs (CV: 29.97%).
  Performance is unpredictable and may lead to inconsistent user experience.

  The max/min ratio of 2.50 indicates
  substantial performance differences between the best and worst runs.

  The P99/P50 ratio of 2.36 suggests
  occasional latency spikes that could affect tail latency sensitive applications.

================================================================================
Generated time series plot: stability_analysis_results/llama3_qlora+iphone15max_ios17_primary_time_series.png

Latency Stability Analysis: llama3_spinq+iphone15max_ios17 (Primary)
================================================================================
Model: llama3_spinq
Device: iphone15max_ios17

Dataset Overview:
  - Number of samples: 72
  - Date range: 2025-04-29 21:26:38+00:00 to 2025-05-10 09:24:40+00:00

Central Tendency Metrics:
  - Mean latency: 14440.01 ms
  - Median latency (P50): 12149.50 ms

Dispersion Metrics:
  - Standard deviation: 5312.72 ms
  - Coefficient of variation (CV): 36.79%
  - Interquartile range (IQR): 2231.00 ms

Percentile Metrics:
  - P50 (median): 12149.50 ms
  - P90: 18765.00 ms
  - P95: 25178.50 ms
  - P99: 35673.00 ms

Inter-Jitter Metrics (variability between runs):
  - Max/Min ratio: 3.6163
  - P99/P50 ratio: 2.9362
  - Mean rolling std (window=5): 3238.06 ms

Throughput Metrics:
  - Mean TPS: 11.66
  - TPS coefficient of variation: 38.53%

Stability Assessment:
  - Overall stability score: 0.0/100
  - Overall stability rating: Poor

Interpretation:
  The benchmark shows poor stability (score: 0.0/100) with significant
  variation between runs (CV: 36.79%).
  Performance is unpredictable and may lead to inconsistent user experience.

  The max/min ratio of 3.62 indicates
  substantial performance differences between the best and worst runs.

  The P99/P50 ratio of 2.94 suggests
  occasional latency spikes that could affect tail latency sensitive applications.

================================================================================
Generated time series plot: stability_analysis_results/llama3_spinq+iphone15max_ios17_primary_time_series.png

Latency Stability Analysis: mv3_xnnq8+iphone15max_ios17 (Primary)
================================================================================
Model: mv3_xnnq8
Device: iphone15max_ios17

Dataset Overview:
  - Number of samples: 54
  - Date range: 2025-04-29 21:26:38+00:00 to 2025-05-10 09:24:40+00:00

Central Tendency Metrics:
  - Mean latency: 13.98 ms
  - Median latency (P50): 14.00 ms

Dispersion Metrics:
  - Standard deviation: 3.44 ms
  - Coefficient of variation (CV): 24.60%
  - Interquartile range (IQR): 4.00 ms

Percentile Metrics:
  - P50 (median): 14.00 ms
  - P90: 18.00 ms
  - P95: 20.00 ms
  - P99: 21.94 ms

Inter-Jitter Metrics (variability between runs):
  - Max/Min ratio: 3.2857
  - P99/P50 ratio: 1.5671
  - Mean rolling std (window=5): 3.40 ms

Stability Assessment:
  - Overall stability score: 10.8/100
  - Overall stability rating: Poor

Interpretation:
  The benchmark shows poor stability (score: 10.8/100) with significant
  variation between runs (CV: 24.60%).
  Performance is unpredictable and may lead to inconsistent user experience.

  The max/min ratio of 3.29 indicates
  substantial performance differences between the best and worst runs.

  The P99/P50 ratio of 1.57 suggests
  occasional latency spikes that could affect tail latency sensitive applications.

================================================================================
Generated time series plot: stability_analysis_results/mv3_xnnq8+iphone15max_ios17_primary_time_series.png


====================================================================================================
===== PRIVATE VS PUBLIC STABILITY COMPARISON ======================================================
====================================================================================================

Warning: No matching reference dataset for llama3_qlora+s22_android13
Warning: No matching reference dataset for llama3_spinq+s22_android13

Private vs Public Stability Comparison: mv3_qnn+s22_android13
================================================================================
Model: mv3_qnn
Device: s22_android13

Metric Comparison:
+-------------------------+---------------------+----------------------+--------------+------------+
| Metric                  | Private (Primary)   | Public (Reference)   | Difference   | % Change   |
+=========================+=====================+======================+==============+============+
| Mean Latency (ms)       | 1.01 ms             | 1.44 ms              | -0.44 ms     | -30.3%     |
+-------------------------+---------------------+----------------------+--------------+------------+
| Median Latency (ms)     | 1.00 ms             | 1.00 ms              | 0.00 ms      | 0.0%       |
+-------------------------+---------------------+----------------------+--------------+------------+
| Standard Deviation (ms) | 0.02 ms             | 0.83 ms              | -0.80 ms     | -97.2%     |
+-------------------------+---------------------+----------------------+--------------+------------+
| CV (%)                  | 2.34%               | 57.29%               | -54.95%      | -95.9%     |
+-------------------------+---------------------+----------------------+--------------+------------+
| IQR (ms)                | 0.01 ms             | 0.06 ms              | -0.05 ms     | -83.3%     |
+-------------------------+---------------------+----------------------+--------------+------------+
| P99 (ms)                | 1.14 ms             | 3.95 ms              | -2.81 ms     | -71.1%     |
+-------------------------+---------------------+----------------------+--------------+------------+
| Max/Min Ratio           | 1.1919              | 4.5354               | -3.3434      | -73.7%     |
+-------------------------+---------------------+----------------------+--------------+------------+
| P99/P50 Ratio           | 1.1404              | 3.9482               | -2.8078      | -71.1%     |
+-------------------------+---------------------+----------------------+--------------+------------+
| Stability Score         | 82.4/100            | 0.0/100              | 82.4         | Infinity   |
+-------------------------+---------------------+----------------------+--------------+------------+
| Stability Rating        | Good                | Poor                 | N/A          | N/A        |
+-------------------------+---------------------+----------------------+--------------+------------+

Interpretation:
  Private environment shows better stability.
  (Private: 82.4/100 vs Public: 0.0/100)
  Private environment has 95.9% lower coefficient of variation, indicating more consistent performance.
  Private environment has 30.3% lower mean latency, indicating better performance.

Recommendation:
  The private environment provides better stability for this model+device combination.
  It is recommended for applications where consistent performance is critical.

================================================================================

Private vs Public Stability Comparison: mv3_xnnq8+s22_android13
================================================================================
Model: mv3_xnnq8
Device: s22_android13

Metric Comparison:
+-------------------------+---------------------+----------------------+--------------+------------+
| Metric                  | Private (Primary)   | Public (Reference)   | Difference   | % Change   |
+=========================+=====================+======================+==============+============+
| Mean Latency (ms)       | 2.73 ms             | 1.92 ms              | 0.81 ms      | 42.1%      |
+-------------------------+---------------------+----------------------+--------------+------------+
| Median Latency (ms)     | 2.65 ms             | 1.06 ms              | 1.59 ms      | 150.0%     |
+-------------------------+---------------------+----------------------+--------------+------------+
| Standard Deviation (ms) | 0.63 ms             | 1.06 ms              | -0.43 ms     | -40.6%     |
+-------------------------+---------------------+----------------------+--------------+------------+
| CV (%)                  | 23.03%              | 55.09%               | -32.06%      | -58.2%     |
+-------------------------+---------------------+----------------------+--------------+------------+
| IQR (ms)                | 0.95 ms             | 1.63 ms              | -0.68 ms     | -41.9%     |
+-------------------------+---------------------+----------------------+--------------+------------+
| P99 (ms)                | 4.46 ms             | 4.63 ms              | -0.18 ms     | -3.8%      |
+-------------------------+---------------------+----------------------+--------------+------------+
| Max/Min Ratio           | 2.4427              | 6.1313               | -3.6886      | -60.2%     |
+-------------------------+---------------------+----------------------+--------------+------------+
| P99/P50 Ratio           | 1.6812              | 4.3683               | -2.6871      | -61.5%     |
+-------------------------+---------------------+----------------------+--------------+------------+
| Stability Score         | 14.9/100            | 0.0/100              | 14.9         | Infinity   |
+-------------------------+---------------------+----------------------+--------------+------------+
| Stability Rating        | Poor                | Poor                 | N/A          | N/A        |
+-------------------------+---------------------+----------------------+--------------+------------+

Interpretation:
  Private environment shows better stability.
  (Private: 14.9/100 vs Public: 0.0/100)
  Private environment has 58.2% lower coefficient of variation, indicating more consistent performance.
  Public environment has 42.1% lower mean latency, indicating better performance.

Recommendation:
  The private environment provides better stability for this model+device combination.
  It is recommended for applications where consistent performance is critical.

================================================================================
Warning: No matching reference dataset for llama3_qlora+s22ultra_android14
Warning: No matching reference dataset for llama3_spinq+s22ultra_android14

Private vs Public Stability Comparison: mv3_qnn+s22ultra_android14
================================================================================
Model: mv3_qnn
Device: s22ultra_android14

Metric Comparison:
+-------------------------+---------------------+----------------------+--------------+------------+
| Metric                  | Private (Primary)   | Public (Reference)   | Difference   | % Change   |
+=========================+=====================+======================+==============+============+
| Mean Latency (ms)       | 1.01 ms             | 1.02 ms              | -0.00 ms     | -0.1%      |
+-------------------------+---------------------+----------------------+--------------+------------+
| Median Latency (ms)     | 1.01 ms             | 1.01 ms              | 0.00 ms      | 0.0%       |
+-------------------------+---------------------+----------------------+--------------+------------+
| Standard Deviation (ms) | 0.01 ms             | 0.01 ms              | -0.00 ms     | -32.7%     |
+-------------------------+---------------------+----------------------+--------------+------------+
| CV (%)                  | 0.91%               | 1.35%                | -0.44%       | -32.6%     |
+-------------------------+---------------------+----------------------+--------------+------------+
| IQR (ms)                | 0.01 ms             | 0.01 ms              | 0.00 ms      | 0.0%       |
+-------------------------+---------------------+----------------------+--------------+------------+
| P99 (ms)                | 1.03 ms             | 1.08 ms              | -0.04 ms     | -4.1%      |
+-------------------------+---------------------+----------------------+--------------+------------+
| Max/Min Ratio           | 1.0900              | 1.0990               | -0.0090      | -0.8%      |
+-------------------------+---------------------+----------------------+--------------+------------+
| P99/P50 Ratio           | 1.0204              | 1.0646               | -0.0442      | -4.1%      |
+-------------------------+---------------------+----------------------+--------------+------------+
| Stability Score         | 93.8/100            | 90.4/100             | 3.4          | 3.8%       |
+-------------------------+---------------------+----------------------+--------------+------------+
| Stability Rating        | Excellent           | Excellent            | N/A          | N/A        |
+-------------------------+---------------------+----------------------+--------------+------------+

Interpretation:
  Private environment shows better stability with a 3.8% higher stability score.
  (Private: 93.8/100 vs Public: 90.4/100)
  Private environment has 32.6% lower coefficient of variation, indicating more consistent performance.
  Private environment has 0.1% lower mean latency, indicating better performance.

Recommendation:
  The private environment provides better stability for this model+device combination.
  It is recommended for applications where consistent performance is critical.

================================================================================

Private vs Public Stability Comparison: mv3_xnnq8+s22ultra_android14
================================================================================
Model: mv3_xnnq8
Device: s22ultra_android14

Metric Comparison:
+-------------------------+---------------------+----------------------+--------------+------------+
| Metric                  | Private (Primary)   | Public (Reference)   | Difference   | % Change   |
+=========================+=====================+======================+==============+============+
| Mean Latency (ms)       | 2.91 ms             | 3.63 ms              | -0.72 ms     | -20.0%     |
+-------------------------+---------------------+----------------------+--------------+------------+
| Median Latency (ms)     | 2.54 ms             | 3.62 ms              | -1.08 ms     | -30.0%     |
+-------------------------+---------------------+----------------------+--------------+------------+
| Standard Deviation (ms) | 1.14 ms             | 0.81 ms              | 0.32 ms      | 39.9%      |
+-------------------------+---------------------+----------------------+--------------+------------+
| CV (%)                  | 39.08%              | 22.35%               | 16.73%       | 74.8%      |
+-------------------------+---------------------+----------------------+--------------+------------+
| IQR (ms)                | 0.82 ms             | 0.94 ms              | -0.12 ms     | -12.6%     |
+-------------------------+---------------------+----------------------+--------------+------------+
| P99 (ms)                | 5.91 ms             | 5.50 ms              | 0.41 ms      | 7.5%       |
+-------------------------+---------------------+----------------------+--------------+------------+
| Max/Min Ratio           | 5.6103              | 2.7228               | 2.8875       | 106.0%     |
+-------------------------+---------------------+----------------------+--------------+------------+
| P99/P50 Ratio           | 2.3319              | 1.5193               | 0.8126       | 53.5%      |
+-------------------------+---------------------+----------------------+--------------+------------+
| Stability Score         | 0.0/100             | 15.5/100             | -15.5        | -100.0%    |
+-------------------------+---------------------+----------------------+--------------+------------+
| Stability Rating        | Poor                | Poor                 | N/A          | N/A        |
+-------------------------+---------------------+----------------------+--------------+------------+

Interpretation:
  Public environment shows better stability.
  (Private: 0.0/100 vs Public: 15.5/100)
  Public environment has 74.8% lower coefficient of variation, indicating more consistent performance.
  Private environment has 20.0% lower mean latency, indicating better performance.

Recommendation:
  The public environment provides better stability for this model+device combination.
  Consider investigating factors affecting stability in the private environment.

================================================================================
Warning: No matching reference dataset for mv3_xnnq8+pixel3_rooted_android

Private vs Public Stability Comparison: llama3_qlora+iphone15max_ios17
================================================================================
Model: llama3_qlora
Device: iphone15max_ios17

Metric Comparison:
+-------------------------+---------------------+----------------------+--------------+------------+
| Metric                  | Private (Primary)   | Public (Reference)   | Difference   | % Change   |
+=========================+=====================+======================+==============+============+
| Mean Latency (ms)       | 15239.42 ms         | 14133.01 ms          | 1106.41 ms   | 7.8%       |
+-------------------------+---------------------+----------------------+--------------+------------+
| Median Latency (ms)     | 13167.50 ms         | 13132.50 ms          | 35.00 ms     | 0.3%       |
+-------------------------+---------------------+----------------------+--------------+------------+
| Standard Deviation (ms) | 4566.55 ms          | 3019.85 ms           | 1546.71 ms   | 51.2%      |
+-------------------------+---------------------+----------------------+--------------+------------+
| CV (%)                  | 29.97%              | 21.37%               | 8.60%        | 40.2%      |
+-------------------------+---------------------+----------------------+--------------+------------+
| IQR (ms)                | 2261.25 ms          | 527.50 ms            | 1733.75 ms   | 328.7%     |
+-------------------------+---------------------+----------------------+--------------+------------+
| P99 (ms)                | 31016.48 ms         | 25167.92 ms          | 5848.56 ms   | 23.2%      |
+-------------------------+---------------------+----------------------+--------------+------------+
| Max/Min Ratio           | 2.5011              | 2.3216               | 0.1795       | 7.7%       |
+-------------------------+---------------------+----------------------+--------------+------------+
| P99/P50 Ratio           | 2.3555              | 1.9165               | 0.4391       | 22.9%      |
+-------------------------+---------------------+----------------------+--------------+------------+
| Stability Score         | 6.2/100             | 10.6/100             | -4.3         | -41.0%     |
+-------------------------+---------------------+----------------------+--------------+------------+
| Stability Rating        | Poor                | Poor                 | N/A          | N/A        |
+-------------------------+---------------------+----------------------+--------------+------------+

Interpretation:
  Public environment shows better stability with a 41.0% higher stability score.
  (Private: 6.2/100 vs Public: 10.6/100)
  Public environment has 40.2% lower coefficient of variation, indicating more consistent performance.
  Public environment has 7.8% lower mean latency, indicating better performance.

Recommendation:
  The public environment provides better stability for this model+device combination.
  Consider investigating factors affecting stability in the private environment.

================================================================================

Private vs Public Stability Comparison: llama3_spinq+iphone15max_ios17
================================================================================
Model: llama3_spinq
Device: iphone15max_ios17

Metric Comparison:
+-------------------------+---------------------+----------------------+--------------+------------+
| Metric                  | Private (Primary)   | Public (Reference)   | Difference   | % Change   |
+=========================+=====================+======================+==============+============+
| Mean Latency (ms)       | 14440.01 ms         | 13118.40 ms          | 1321.61 ms   | 10.1%      |
+-------------------------+---------------------+----------------------+--------------+------------+
| Median Latency (ms)     | 12149.50 ms         | 12382.50 ms          | -233.00 ms   | -1.9%      |
+-------------------------+---------------------+----------------------+--------------+------------+
| Standard Deviation (ms) | 5312.72 ms          | 2853.94 ms           | 2458.78 ms   | 86.2%      |
+-------------------------+---------------------+----------------------+--------------+------------+
| CV (%)                  | 36.79%              | 21.76%               | 15.04%       | 69.1%      |
+-------------------------+---------------------+----------------------+--------------+------------+
| IQR (ms)                | 2231.00 ms          | 680.50 ms            | 1550.50 ms   | 227.8%     |
+-------------------------+---------------------+----------------------+--------------+------------+
| P99 (ms)                | 35673.00 ms         | 26265.08 ms          | 9407.92 ms   | 35.8%      |
+-------------------------+---------------------+----------------------+--------------+------------+
| Max/Min Ratio           | 3.6163              | 2.7878               | 0.8286       | 29.7%      |
+-------------------------+---------------------+----------------------+--------------+------------+
| P99/P50 Ratio           | 2.9362              | 2.1211               | 0.8150       | 38.4%      |
+-------------------------+---------------------+----------------------+--------------+------------+
| Stability Score         | 0.0/100             | 2.7/100              | -2.7         | -100.0%    |
+-------------------------+---------------------+----------------------+--------------+------------+
| Stability Rating        | Poor                | Poor                 | N/A          | N/A        |
+-------------------------+---------------------+----------------------+--------------+------------+

Interpretation:
  Public environment shows better stability.
  (Private: 0.0/100 vs Public: 2.7/100)
  Public environment has 69.1% lower coefficient of variation, indicating more consistent performance.
  Public environment has 10.1% lower mean latency, indicating better performance.

Recommendation:
  The public environment provides better stability for this model+device combination.
  Consider investigating factors affecting stability in the private environment.

================================================================================
Warning: No matching reference dataset for mv3_xnnq8+iphone15max_ios17


====================================================================================================
===== INTRA-PRIMARY STABILITY COMPARISON ==========================================================
====================================================================================================


Intra-Primary Stability Comparison
================================================================================

Overall Summary:
+---------------------------------+--------------+-----------------------+---------------------+----------+-------------------+--------------------+-----------------+-----------------+
| Sheet                           | Model        | Device                |   Mean Latency (ms) |   CV (%) |   Stability Score | Stability Rating   |   Max/Min Ratio |   P99/P50 Ratio |
+=================================+==============+=======================+=====================+==========+===================+====================+=================+=================+
| mv3_qnn+s22ultra_android14      | mv3_qnn      | s22ultra_android14    |                1.01 |     0.91 |             93.81 | Excellent          |            1.09 |            1.02 |
+---------------------------------+--------------+-----------------------+---------------------+----------+-------------------+--------------------+-----------------+-----------------+
| llama3_spinq+s22_android13      | llama3_spinq | s22_android13         |            21771.59 |     2.36 |             84.70 | Good               |            1.15 |            1.07 |
+---------------------------------+--------------+-----------------------+---------------------+----------+-------------------+--------------------+-----------------+-----------------+
| llama3_qlora+s22_android13      | llama3_qlora | s22_android13         |            22502.10 |     2.64 |             83.37 | Good               |            1.14 |            1.07 |
+---------------------------------+--------------+-----------------------+---------------------+----------+-------------------+--------------------+-----------------+-----------------+
| mv3_qnn+s22_android13           | mv3_qnn      | s22_android13         |                1.01 |     2.34 |             82.41 | Good               |            1.19 |            1.14 |
+---------------------------------+--------------+-----------------------+---------------------+----------+-------------------+--------------------+-----------------+-----------------+
| llama3_qlora+s22ultra_android14 | llama3_qlora | s22ultra_android14    |            25022.84 |     6.18 |             62.54 | Moderate           |            1.27 |            1.13 |
+---------------------------------+--------------+-----------------------+---------------------+----------+-------------------+--------------------+-----------------+-----------------+
| llama3_spinq+s22ultra_android14 | llama3_spinq | s22ultra_android14    |            24761.78 |     6.27 |             60.28 | Moderate           |            1.36 |            1.15 |
+---------------------------------+--------------+-----------------------+---------------------+----------+-------------------+--------------------+-----------------+-----------------+
| mv3_xnnq8+pixel3_rooted_android | mv3_xnnq8    | pixel3_rooted_android |                5.93 |     7.68 |             46.93 | Poor               |            1.70 |            1.24 |
+---------------------------------+--------------+-----------------------+---------------------+----------+-------------------+--------------------+-----------------+-----------------+
| mv3_xnnq8+s22_android13         | mv3_xnnq8    | s22_android13         |                2.73 |    23.03 |             14.94 | Poor               |            2.44 |            1.68 |
+---------------------------------+--------------+-----------------------+---------------------+----------+-------------------+--------------------+-----------------+-----------------+
| mv3_xnnq8+iphone15max_ios17     | mv3_xnnq8    | iphone15max_ios17     |               13.98 |    24.60 |             10.82 | Poor               |            3.29 |            1.57 |
+---------------------------------+--------------+-----------------------+---------------------+----------+-------------------+--------------------+-----------------+-----------------+
| llama3_qlora+iphone15max_ios17  | llama3_qlora | iphone15max_ios17     |            15239.42 |    29.97 |              6.24 | Poor               |            2.50 |            2.36 |
+---------------------------------+--------------+-----------------------+---------------------+----------+-------------------+--------------------+-----------------+-----------------+
| mv3_xnnq8+s22ultra_android14    | mv3_xnnq8    | s22ultra_android14    |                2.91 |    39.08 |              0.00 | Poor               |            5.61 |            2.33 |
+---------------------------------+--------------+-----------------------+---------------------+----------+-------------------+--------------------+-----------------+-----------------+
| llama3_spinq+iphone15max_ios17  | llama3_spinq | iphone15max_ios17     |            14440.01 |    36.79 |              0.00 | Poor               |            3.62 |            2.94 |
+---------------------------------+--------------+-----------------------+---------------------+----------+-------------------+--------------------+-----------------+-----------------+

Best and Worst Performers:
  Best stability: mv3_qnn+s22ultra_android14 (Score: 93.8/100)
  Worst stability: mv3_xnnq8+s22ultra_android14 (Score: 0.0/100)

Model-based Comparison:
+--------------+-------------------------------+------------------------------+------------------------------+----------------------+---------------------+---------------------+
| Model        |   ('Stability Score', 'mean') |   ('Stability Score', 'min') |   ('Stability Score', 'max') |   ('CV (%)', 'mean') |   ('CV (%)', 'min') |   ('CV (%)', 'max') |
+==============+===============================+==============================+==============================+======================+=====================+=====================+
| mv3_qnn      |                         88.11 |                        82.41 |                        93.81 |                 1.62 |                0.91 |                2.34 |
+--------------+-------------------------------+------------------------------+------------------------------+----------------------+---------------------+---------------------+
| llama3_qlora |                         50.72 |                         6.24 |                        83.37 |                12.93 |                2.64 |               29.97 |
+--------------+-------------------------------+------------------------------+------------------------------+----------------------+---------------------+---------------------+
| llama3_spinq |                         48.33 |                         0.00 |                        84.70 |                15.14 |                2.36 |               36.79 |
+--------------+-------------------------------+------------------------------+------------------------------+----------------------+---------------------+---------------------+
| mv3_xnnq8    |                         18.17 |                         0.00 |                        46.93 |                23.60 |                7.68 |               39.08 |
+--------------+-------------------------------+------------------------------+------------------------------+----------------------+---------------------+---------------------+
  Most stable model: mv3_qnn (Avg. Score: 88.1/100)

Device-based Comparison:
+-----------------------+-------------------------------+------------------------------+------------------------------+----------------------+---------------------+---------------------+
| Device                |   ('Stability Score', 'mean') |   ('Stability Score', 'min') |   ('Stability Score', 'max') |   ('CV (%)', 'mean') |   ('CV (%)', 'min') |   ('CV (%)', 'max') |
+=======================+===============================+==============================+==============================+======================+=====================+=====================+
| s22_android13         |                         66.36 |                        14.94 |                        84.70 |                 7.59 |                2.34 |               23.03 |
+-----------------------+-------------------------------+------------------------------+------------------------------+----------------------+---------------------+---------------------+
| s22ultra_android14    |                         54.16 |                         0.00 |                        93.81 |                13.11 |                0.91 |               39.08 |
+-----------------------+-------------------------------+------------------------------+------------------------------+----------------------+---------------------+---------------------+
| pixel3_rooted_android |                         46.93 |                        46.93 |                        46.93 |                 7.68 |                7.68 |                7.68 |
+-----------------------+-------------------------------+------------------------------+------------------------------+----------------------+---------------------+---------------------+
| iphone15max_ios17     |                          5.69 |                         0.00 |                        10.82 |                30.45 |               24.60 |               36.79 |
+-----------------------+-------------------------------+------------------------------+------------------------------+----------------------+---------------------+---------------------+
  Most stable device: s22_android13 (Avg. Score: 66.4/100)

Insights and Recommendations:
  - mv3_qnn shows the most consistent performance across devices.
  - mv3_xnnq8 shows more variability and may need further optimization.
  - s22_android13 provides the most stable environment for model execution.
  - iphone15max_ios17 shows higher variability and may not be ideal for latency-sensitive applications.
  - For critical applications requiring consistent performance, prefer:
    * Model: mv3_qnn
    * Device: s22_android13

================================================================================

********************************************************************************
********************************************************************************
********************************************************************************

Comprehensive Latency Stability Analysis Summary
================================================================================

Primary (Private) Datasets Summary:
+---------------------------------+--------------+-----------------------+---------------------+----------+-------------------+--------------------+
| Dataset                         | Model        | Device                |   Mean Latency (ms) |   CV (%) |   Stability Score | Stability Rating   |
+=================================+==============+=======================+=====================+==========+===================+====================+
| mv3_qnn+s22ultra_android14      | mv3_qnn      | s22ultra_android14    |                1.01 |     0.91 |             93.81 | Excellent          |
+---------------------------------+--------------+-----------------------+---------------------+----------+-------------------+--------------------+
| llama3_spinq+s22_android13      | llama3_spinq | s22_android13         |            21771.59 |     2.36 |             84.70 | Good               |
+---------------------------------+--------------+-----------------------+---------------------+----------+-------------------+--------------------+
| llama3_qlora+s22_android13      | llama3_qlora | s22_android13         |            22502.10 |     2.64 |             83.37 | Good               |
+---------------------------------+--------------+-----------------------+---------------------+----------+-------------------+--------------------+
| mv3_qnn+s22_android13           | mv3_qnn      | s22_android13         |                1.01 |     2.34 |             82.41 | Good               |
+---------------------------------+--------------+-----------------------+---------------------+----------+-------------------+--------------------+
| llama3_qlora+s22ultra_android14 | llama3_qlora | s22ultra_android14    |            25022.84 |     6.18 |             62.54 | Moderate           |
+---------------------------------+--------------+-----------------------+---------------------+----------+-------------------+--------------------+
| llama3_spinq+s22ultra_android14 | llama3_spinq | s22ultra_android14    |            24761.78 |     6.27 |             60.28 | Moderate           |
+---------------------------------+--------------+-----------------------+---------------------+----------+-------------------+--------------------+
| mv3_xnnq8+pixel3_rooted_android | mv3_xnnq8    | pixel3_rooted_android |                5.93 |     7.68 |             46.93 | Poor               |
+---------------------------------+--------------+-----------------------+---------------------+----------+-------------------+--------------------+
| mv3_xnnq8+s22_android13         | mv3_xnnq8    | s22_android13         |                2.73 |    23.03 |             14.94 | Poor               |
+---------------------------------+--------------+-----------------------+---------------------+----------+-------------------+--------------------+
| mv3_xnnq8+iphone15max_ios17     | mv3_xnnq8    | iphone15max_ios17     |               13.98 |    24.60 |             10.82 | Poor               |
+---------------------------------+--------------+-----------------------+---------------------+----------+-------------------+--------------------+
| llama3_qlora+iphone15max_ios17  | llama3_qlora | iphone15max_ios17     |            15239.42 |    29.97 |              6.24 | Poor               |
+---------------------------------+--------------+-----------------------+---------------------+----------+-------------------+--------------------+
| mv3_xnnq8+s22ultra_android14    | mv3_xnnq8    | s22ultra_android14    |                2.91 |    39.08 |              0.00 | Poor               |
+---------------------------------+--------------+-----------------------+---------------------+----------+-------------------+--------------------+
| llama3_spinq+iphone15max_ios17  | llama3_spinq | iphone15max_ios17     |            14440.01 |    36.79 |              0.00 | Poor               |
+---------------------------------+--------------+-----------------------+---------------------+----------+-------------------+--------------------+

Reference (Public) Datasets Summary:
+--------------------------------+--------------+--------------------+---------------------+----------+-------------------+--------------------+
| Dataset                        | Model        | Device             |   Mean Latency (ms) |   CV (%) |   Stability Score | Stability Rating   |
+================================+==============+====================+=====================+==========+===================+====================+
| mv3_qnn+s22ultra_android14     | mv3_qnn      | s22ultra_android14 |                1.02 |     1.35 |             90.39 | Excellent          |
+--------------------------------+--------------+--------------------+---------------------+----------+-------------------+--------------------+
| mv3_xnnq8+s22ultra_android14   | mv3_xnnq8    | s22ultra_android14 |                3.63 |    22.35 |             15.48 | Poor               |
+--------------------------------+--------------+--------------------+---------------------+----------+-------------------+--------------------+
| llama3_qlora+iphone15max_ios17 | llama3_qlora | iphone15max_ios17  |            14133.01 |    21.37 |             10.57 | Poor               |
+--------------------------------+--------------+--------------------+---------------------+----------+-------------------+--------------------+
| llama3_spinq+iphone15max_ios17 | llama3_spinq | iphone15max_ios17  |            13118.40 |    21.76 |              2.65 | Poor               |
+--------------------------------+--------------+--------------------+---------------------+----------+-------------------+--------------------+
| mv3_xnnq8+s22_android13        | mv3_xnnq8    | s22_android13      |                1.92 |    55.09 |              0.00 | Poor               |
+--------------------------------+--------------+--------------------+---------------------+----------+-------------------+--------------------+
| mv3_qnn+s22_android13          | mv3_qnn      | s22_android13      |                1.44 |    57.29 |              0.00 | Poor               |
+--------------------------------+--------------+--------------------+---------------------+----------+-------------------+--------------------+
| mv3_spinq+iphone15max_ios17    | mv3_spinq    | iphone15max_ios17  |                5.05 |   136.34 |              0.00 | Poor               |
+--------------------------------+--------------+--------------------+---------------------+----------+-------------------+--------------------+

Private vs Public Comparison:
+--------------------------------+--------------+--------------------+-----------------+----------------+--------------+------------------+-----------------+---------------+
| Dataset                        | Model        | Device             |   Private Score |   Public Score |   Score Diff |   Private CV (%) |   Public CV (%) |   CV Diff (%) |
+================================+==============+====================+=================+================+==============+==================+=================+===============+
| mv3_qnn+s22_android13          | mv3_qnn      | s22_android13      |           82.41 |           0.00 |        82.41 |             2.34 |           57.29 |        -54.95 |
+--------------------------------+--------------+--------------------+-----------------+----------------+--------------+------------------+-----------------+---------------+
| mv3_xnnq8+s22_android13        | mv3_xnnq8    | s22_android13      |           14.94 |           0.00 |        14.94 |            23.03 |           55.09 |        -32.06 |
+--------------------------------+--------------+--------------------+-----------------+----------------+--------------+------------------+-----------------+---------------+
| mv3_qnn+s22ultra_android14     | mv3_qnn      | s22ultra_android14 |           93.81 |          90.39 |         3.42 |             0.91 |            1.35 |         -0.44 |
+--------------------------------+--------------+--------------------+-----------------+----------------+--------------+------------------+-----------------+---------------+
| llama3_spinq+iphone15max_ios17 | llama3_spinq | iphone15max_ios17  |            0.00 |           2.65 |        -2.65 |            36.79 |           21.76 |         15.04 |
+--------------------------------+--------------+--------------------+-----------------+----------------+--------------+------------------+-----------------+---------------+
| llama3_qlora+iphone15max_ios17 | llama3_qlora | iphone15max_ios17  |            6.24 |          10.57 |        -4.33 |            29.97 |           21.37 |          8.60 |
+--------------------------------+--------------+--------------------+-----------------+----------------+--------------+------------------+-----------------+---------------+
| mv3_xnnq8+s22ultra_android14   | mv3_xnnq8    | s22ultra_android14 |            0.00 |          15.48 |       -15.48 |            39.08 |           22.35 |         16.73 |
+--------------------------------+--------------+--------------------+-----------------+----------------+--------------+------------------+-----------------+---------------+

Private environment is more stable in 3 of 6 cases.
Public environment is more stable in 3 of 6 cases.

Overall Insights and Recommendations:
Stability Distribution in Private Datasets:
  - Poor: 6 dataset(s)
  - Good: 3 dataset(s)
  - Moderate: 2 dataset(s)
  - Excellent: 1 dataset(s)

Best Configurations:
  - Most stable configuration: mv3_qnn+s22ultra_android14 (Score: 93.8/100)
    Model: mv3_qnn, Device: s22ultra_android14

General Recommendations:
  1. For datasets with 'Poor' or 'Moderate' stability, investigate potential causes
     such as thermal throttling, background processes, or power management settings.
  2. Consider increasing warm-up iterations for datasets with high CV values.
  3. For critical applications, prefer models and devices with 'Good' or 'Excellent' stability.

================================================================================

Analysis complete. Results saved to stability_analysis_results/

Copy link

pytorch-bot bot commented May 19, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/10982

Note: Links to docs will display an error until the docs builds have been completed.

❌ 2 New Failures

As of commit cae229e with merge base 6daeb64 (image):

NEW FAILURES - The following jobs have failed:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label May 19, 2025
Copy link

This PR needs a release notes: label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:.

If not, please add the release notes: none label.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

@guangy10 guangy10 requested review from huydhn and yangw-dev May 19, 2025 22:13
Copy link
Contributor

@yangw-dev yangw-dev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

recommand to add the pip dependencies in requirements.txt next to the analyze_latency_stability.py

maybe it's good it has its own folder

@guangy10 guangy10 changed the title Script for benchmark satbility assessment [Not To Land] Script for benchmark satbility assessment May 19, 2025
@guangy10 guangy10 force-pushed the benchmark_assessment branch from 11c450b to db1819b Compare May 21, 2025 22:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants