Skip to content

[UR] Bump UMF to v0.11.0-dev3 #17188

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

lukaszstolarczuk
Copy link
Contributor

From now on disjoint_pool is part of libumf, instead of being a separate library.

@lukaszstolarczuk
Copy link
Contributor Author

lukaszstolarczuk commented Mar 3, 2025

@kbenzie, could you please re-run the failing job(s) - at least the BMG one, which has "lost communication with the server" error

@lukaszstolarczuk
Copy link
Contributor Author

changing to draft, as Regression/static-buffer-dtor.cpp keeps failing in CUDA build - to be debugged.

Copy link
Contributor

Compute Benchmarks level_zero run (with params: ):
https://github.com/intel/llvm/actions/runs/13787698847

Copy link
Contributor

Benchmarks level_zero run ():
https://github.com/intel/llvm/actions/runs/13787698847
Job status: failure. Test status: skipped.

Copy link
Contributor

Compute Benchmarks level_zero run (with params: ):
https://github.com/intel/llvm/actions/runs/13809259824

Copy link
Contributor

Benchmarks level_zero run ():
https://github.com/intel/llvm/actions/runs/13809259824
Job status: success. Test status: success.

Summary

(Emphasized values are the best results)
No diffs to calculate performance change

Performance change in benchmark groups

Velocity Bench
Relative perf in group Other (8)
Benchmark This PR
Velocity-Bench Hashtable 331.478078 M keys/sec
Velocity-Bench Bitcracker 35.343600 s
Velocity-Bench CudaSift 202.858000 ms
Velocity-Bench QuickSilver 118.170000 MMS/CTT
Velocity-Bench Sobel Filter 707.292000 ms
Velocity-Bench dl-cifar 24.512200 s
Velocity-Bench dl-mnist 2.660000 s
Velocity-Bench svm 0.139200 s
SYCL-Bench
Relative perf in group Other (53)
Benchmark This PR
Runtime_IndependentDAGTaskThroughput_SingleTask 232.858000 ms
Runtime_IndependentDAGTaskThroughput_BasicParallelFor 249.315000 ms
Runtime_IndependentDAGTaskThroughput_HierarchicalParallelFor 248.844000 ms
Runtime_IndependentDAGTaskThroughput_NDRangeParallelFor 243.072000 ms
Runtime_DAGTaskThroughput_SingleTask 1688.583000 ms
Runtime_DAGTaskThroughput_BasicParallelFor 1743.255000 ms
Runtime_DAGTaskThroughput_HierarchicalParallelFor 1730.711000 ms
Runtime_DAGTaskThroughput_NDRangeParallelFor 1707.245000 ms
MicroBench_HostDeviceBandwidth_1D_H2D_Contiguous 6.231000 ms
MicroBench_HostDeviceBandwidth_2D_H2D_Contiguous 6.246000 ms
MicroBench_HostDeviceBandwidth_3D_H2D_Contiguous 6.070000 ms
MicroBench_HostDeviceBandwidth_1D_D2H_Contiguous 6.136000 ms
MicroBench_HostDeviceBandwidth_2D_D2H_Contiguous 652.904000 ms
MicroBench_HostDeviceBandwidth_3D_D2H_Contiguous 652.951000 ms
MicroBench_HostDeviceBandwidth_1D_H2D_Strided 6.017000 ms
MicroBench_HostDeviceBandwidth_2D_H2D_Strided 6.233000 ms
MicroBench_HostDeviceBandwidth_3D_H2D_Strided 6.308000 ms
MicroBench_HostDeviceBandwidth_1D_D2H_Strided 6.039000 ms
MicroBench_HostDeviceBandwidth_2D_D2H_Strided 652.737000 ms
MicroBench_HostDeviceBandwidth_3D_D2H_Strided 652.780000 ms
MicroBench_LocalMem_int32_4096 29.604000 ms
MicroBench_LocalMem_fp32_4096 29.696000 ms
Pattern_Reduction_NDRange_int32 13.759000 ms
Pattern_Reduction_Hierarchical_int32 12.421000 ms
ScalarProduct_NDRange_int32 3.864000 ms
ScalarProduct_NDRange_int64 5.577000 ms
ScalarProduct_NDRange_fp32 3.856000 ms
ScalarProduct_Hierarchical_int32 11.343000 ms
ScalarProduct_Hierarchical_int64 12.277000 ms
ScalarProduct_Hierarchical_fp32 10.980000 ms
Pattern_SegmentedReduction_NDRange_int16 2.405000 ms
Pattern_SegmentedReduction_NDRange_int32 2.313000 ms
Pattern_SegmentedReduction_NDRange_int64 2.505000 ms
Pattern_SegmentedReduction_NDRange_fp32 2.322000 ms
Pattern_SegmentedReduction_Hierarchical_int16 12.414000 ms
Pattern_SegmentedReduction_Hierarchical_int32 12.302000 ms
Pattern_SegmentedReduction_Hierarchical_int64 12.491000 ms
Pattern_SegmentedReduction_Hierarchical_fp32 12.291000 ms
USM_Allocation_latency_fp32_device 0.054000 ms
USM_Allocation_latency_fp32_host 18.909000 ms
USM_Allocation_latency_fp32_shared 0.062000 ms
USM_Instr_Mix_fp32_device_1:1mix_with_init_no_prefetch 1.673000 ms
USM_Instr_Mix_fp32_host_1:1mix_with_init_no_prefetch 1.085000 ms
USM_Instr_Mix_fp32_device_1:1mix_no_init_no_prefetch 1.825000 ms
USM_Instr_Mix_fp32_host_1:1mix_no_init_no_prefetch 1.231000 ms
VectorAddition_int32 1.441000 ms
VectorAddition_int64 3.079000 ms
VectorAddition_fp32 1.453000 ms
Polybench_2mm 1.273000 ms
Polybench_3mm 1.797000 ms
Polybench_Atax 7.001000 ms
Kmeans_fp32 16.161000 ms
MolecularDynamics 0.030000 ms
llama.cpp bench
Relative perf in group Other (6)
Benchmark This PR
llama.cpp Prompt Processing Batched 128 882.656841 token/s
llama.cpp Text Generation Batched 128 61.600374 token/s
llama.cpp Prompt Processing Batched 256 938.157778 token/s
llama.cpp Text Generation Batched 256 61.711753 token/s
llama.cpp Prompt Processing Batched 512 540.728519 token/s
llama.cpp Text Generation Batched 512 61.711133 token/s

Details

Benchmark details - environment, command...
Velocity-Bench Hashtable

Command:

/home/test-user/llvm_bench_workdir/hashtable/hashtable_sycl --no-verify

Velocity-Bench Bitcracker

Command:

/home/test-user/llvm_bench_workdir/bitcracker/bitcracker -f /home/test-user/llvm_bench_workdir/velocity-bench-repo/bitcracker/hash_pass/img_win8_user_hash.txt -d /home/test-user/llvm_bench_workdir/velocity-bench-repo/bitcracker/hash_pass/user_passwords_60000.txt -b 60000

Velocity-Bench CudaSift

Command:

/home/test-user/llvm_bench_workdir/cudaSift/cudaSift

Velocity-Bench QuickSilver

Command:

/home/test-user/llvm_bench_workdir/QuickSilver/qs -i /home/test-user/llvm_bench_workdir/velocity-bench-repo/QuickSilver/Examples/AllScattering/scatteringOnly.inp

Environment Variables:

QS_DEVICE=GPU

Velocity-Bench Sobel Filter

Command:

/home/test-user/llvm_bench_workdir/sobel_filter/sobel_filter -i /home/test-user/llvm_bench_workdir/data/sobel_filter/sobel_filter_data/silverfalls_32Kx32K.png -n 5

Environment Variables:

OPENCV_IO_MAX_IMAGE_PIXELS=1677721600

Velocity-Bench dl-cifar

Command:

/home/test-user/llvm_bench_workdir/dl-cifar/dl-cifar_sycl

Velocity-Bench dl-mnist

Command:

/home/test-user/llvm_bench_workdir/dl-mnist/dl-mnist-sycl -conv_algo ONEDNN_AUTO

Environment Variables:

NEOReadDebugKeys=1
DisableScratchPages=0

Velocity-Bench svm

Command:

/home/test-user/llvm_bench_workdir/svm/svm_sycl /home/test-user/llvm_bench_workdir/velocity-bench-repo/svm/SYCL/a9a /home/test-user/llvm_bench_workdir/velocity-bench-repo/svm/SYCL/a.m

Runtime_IndependentDAGTaskThroughput_SingleTask

Command:

/home/test-user/llvm_bench_workdir/sycl-bench-build/dag_task_throughput_independent --warmup-run --num-runs=3 --output=/home/test-user/llvm_bench_workdir/IndependentDAGTaskThroughput_multi.csv --size=32768

Runtime_IndependentDAGTaskThroughput_BasicParallelFor

Command:

/home/test-user/llvm_bench_workdir/sycl-bench-build/dag_task_throughput_independent --warmup-run --num-runs=3 --output=/home/test-user/llvm_bench_workdir/IndependentDAGTaskThroughput_multi.csv --size=32768

Runtime_IndependentDAGTaskThroughput_HierarchicalParallelFor

Command:

/home/test-user/llvm_bench_workdir/sycl-bench-build/dag_task_throughput_independent --warmup-run --num-runs=3 --output=/home/test-user/llvm_bench_workdir/IndependentDAGTaskThroughput_multi.csv --size=32768

Runtime_IndependentDAGTaskThroughput_NDRangeParallelFor

Command:

/home/test-user/llvm_bench_workdir/sycl-bench-build/dag_task_throughput_independent --warmup-run --num-runs=3 --output=/home/test-user/llvm_bench_workdir/IndependentDAGTaskThroughput_multi.csv --size=32768

Runtime_DAGTaskThroughput_SingleTask

Command:

/home/test-user/llvm_bench_workdir/sycl-bench-build/dag_task_throughput_sequential --warmup-run --num-runs=3 --output=/home/test-user/llvm_bench_workdir/DAGTaskThroughput_multi.csv --size=327680

Runtime_DAGTaskThroughput_BasicParallelFor

Command:

/home/test-user/llvm_bench_workdir/sycl-bench-build/dag_task_throughput_sequential --warmup-run --num-runs=3 --output=/home/test-user/llvm_bench_workdir/DAGTaskThroughput_multi.csv --size=327680

Runtime_DAGTaskThroughput_HierarchicalParallelFor

Command:

/home/test-user/llvm_bench_workdir/sycl-bench-build/dag_task_throughput_sequential --warmup-run --num-runs=3 --output=/home/test-user/llvm_bench_workdir/DAGTaskThroughput_multi.csv --size=327680

Runtime_DAGTaskThroughput_NDRangeParallelFor

Command:

/home/test-user/llvm_bench_workdir/sycl-bench-build/dag_task_throughput_sequential --warmup-run --num-runs=3 --output=/home/test-user/llvm_bench_workdir/DAGTaskThroughput_multi.csv --size=327680

MicroBench_HostDeviceBandwidth_1D_H2D_Contiguous

Command:

/home/test-user/llvm_bench_workdir/sycl-bench-build/host_device_bandwidth --warmup-run --num-runs=3 --output=/home/test-user/llvm_bench_workdir/HostDeviceBandwidth_multi.csv --size=512

MicroBench_HostDeviceBandwidth_2D_H2D_Contiguous

Command:

/home/test-user/llvm_bench_workdir/sycl-bench-build/host_device_bandwidth --warmup-run --num-runs=3 --output=/home/test-user/llvm_bench_workdir/HostDeviceBandwidth_multi.csv --size=512

MicroBench_HostDeviceBandwidth_3D_H2D_Contiguous

Command:

/home/test-user/llvm_bench_workdir/sycl-bench-build/host_device_bandwidth --warmup-run --num-runs=3 --output=/home/test-user/llvm_bench_workdir/HostDeviceBandwidth_multi.csv --size=512

MicroBench_HostDeviceBandwidth_1D_D2H_Contiguous

Command:

/home/test-user/llvm_bench_workdir/sycl-bench-build/host_device_bandwidth --warmup-run --num-runs=3 --output=/home/test-user/llvm_bench_workdir/HostDeviceBandwidth_multi.csv --size=512

MicroBench_HostDeviceBandwidth_2D_D2H_Contiguous

Command:

/home/test-user/llvm_bench_workdir/sycl-bench-build/host_device_bandwidth --warmup-run --num-runs=3 --output=/home/test-user/llvm_bench_workdir/HostDeviceBandwidth_multi.csv --size=512

MicroBench_HostDeviceBandwidth_3D_D2H_Contiguous

Command:

/home/test-user/llvm_bench_workdir/sycl-bench-build/host_device_bandwidth --warmup-run --num-runs=3 --output=/home/test-user/llvm_bench_workdir/HostDeviceBandwidth_multi.csv --size=512

MicroBench_HostDeviceBandwidth_1D_H2D_Strided

Command:

/home/test-user/llvm_bench_workdir/sycl-bench-build/host_device_bandwidth --warmup-run --num-runs=3 --output=/home/test-user/llvm_bench_workdir/HostDeviceBandwidth_multi.csv --size=512

MicroBench_HostDeviceBandwidth_2D_H2D_Strided

Command:

/home/test-user/llvm_bench_workdir/sycl-bench-build/host_device_bandwidth --warmup-run --num-runs=3 --output=/home/test-user/llvm_bench_workdir/HostDeviceBandwidth_multi.csv --size=512

MicroBench_HostDeviceBandwidth_3D_H2D_Strided

Command:

/home/test-user/llvm_bench_workdir/sycl-bench-build/host_device_bandwidth --warmup-run --num-runs=3 --output=/home/test-user/llvm_bench_workdir/HostDeviceBandwidth_multi.csv --size=512

MicroBench_HostDeviceBandwidth_1D_D2H_Strided

Command:

/home/test-user/llvm_bench_workdir/sycl-bench-build/host_device_bandwidth --warmup-run --num-runs=3 --output=/home/test-user/llvm_bench_workdir/HostDeviceBandwidth_multi.csv --size=512

MicroBench_HostDeviceBandwidth_2D_D2H_Strided

Command:

/home/test-user/llvm_bench_workdir/sycl-bench-build/host_device_bandwidth --warmup-run --num-runs=3 --output=/home/test-user/llvm_bench_workdir/HostDeviceBandwidth_multi.csv --size=512

MicroBench_HostDeviceBandwidth_3D_D2H_Strided

Command:

/home/test-user/llvm_bench_workdir/sycl-bench-build/host_device_bandwidth --warmup-run --num-runs=3 --output=/home/test-user/llvm_bench_workdir/HostDeviceBandwidth_multi.csv --size=512

MicroBench_LocalMem_int32_4096

Command:

/home/test-user/llvm_bench_workdir/sycl-bench-build/local_mem --warmup-run --num-runs=3 --output=/home/test-user/llvm_bench_workdir/LocalMem_multi.csv --size=10240000

MicroBench_LocalMem_fp32_4096

Command:

/home/test-user/llvm_bench_workdir/sycl-bench-build/local_mem --warmup-run --num-runs=3 --output=/home/test-user/llvm_bench_workdir/LocalMem_multi.csv --size=10240000

Pattern_Reduction_NDRange_int32

Command:

/home/test-user/llvm_bench_workdir/sycl-bench-build/reduction --warmup-run --num-runs=3 --output=/home/test-user/llvm_bench_workdir/Pattern_Reduction_multi.csv --size=10240000

Pattern_Reduction_Hierarchical_int32

Command:

/home/test-user/llvm_bench_workdir/sycl-bench-build/reduction --warmup-run --num-runs=3 --output=/home/test-user/llvm_bench_workdir/Pattern_Reduction_multi.csv --size=10240000

ScalarProduct_NDRange_int32

Command:

/home/test-user/llvm_bench_workdir/sycl-bench-build/scalar_prod --warmup-run --num-runs=3 --output=/home/test-user/llvm_bench_workdir/ScalarProduct_multi.csv --size=102400000

ScalarProduct_NDRange_int64

Command:

/home/test-user/llvm_bench_workdir/sycl-bench-build/scalar_prod --warmup-run --num-runs=3 --output=/home/test-user/llvm_bench_workdir/ScalarProduct_multi.csv --size=102400000

ScalarProduct_NDRange_fp32

Command:

/home/test-user/llvm_bench_workdir/sycl-bench-build/scalar_prod --warmup-run --num-runs=3 --output=/home/test-user/llvm_bench_workdir/ScalarProduct_multi.csv --size=102400000

ScalarProduct_Hierarchical_int32

Command:

/home/test-user/llvm_bench_workdir/sycl-bench-build/scalar_prod --warmup-run --num-runs=3 --output=/home/test-user/llvm_bench_workdir/ScalarProduct_multi.csv --size=102400000

ScalarProduct_Hierarchical_int64

Command:

/home/test-user/llvm_bench_workdir/sycl-bench-build/scalar_prod --warmup-run --num-runs=3 --output=/home/test-user/llvm_bench_workdir/ScalarProduct_multi.csv --size=102400000

ScalarProduct_Hierarchical_fp32

Command:

/home/test-user/llvm_bench_workdir/sycl-bench-build/scalar_prod --warmup-run --num-runs=3 --output=/home/test-user/llvm_bench_workdir/ScalarProduct_multi.csv --size=102400000

Pattern_SegmentedReduction_NDRange_int16

Command:

/home/test-user/llvm_bench_workdir/sycl-bench-build/segmentedreduction --warmup-run --num-runs=3 --output=/home/test-user/llvm_bench_workdir/Pattern_SegmentedReduction_multi.csv --size=102400000

Pattern_SegmentedReduction_NDRange_int32

Command:

/home/test-user/llvm_bench_workdir/sycl-bench-build/segmentedreduction --warmup-run --num-runs=3 --output=/home/test-user/llvm_bench_workdir/Pattern_SegmentedReduction_multi.csv --size=102400000

Pattern_SegmentedReduction_NDRange_int64

Command:

/home/test-user/llvm_bench_workdir/sycl-bench-build/segmentedreduction --warmup-run --num-runs=3 --output=/home/test-user/llvm_bench_workdir/Pattern_SegmentedReduction_multi.csv --size=102400000

Pattern_SegmentedReduction_NDRange_fp32

Command:

/home/test-user/llvm_bench_workdir/sycl-bench-build/segmentedreduction --warmup-run --num-runs=3 --output=/home/test-user/llvm_bench_workdir/Pattern_SegmentedReduction_multi.csv --size=102400000

Pattern_SegmentedReduction_Hierarchical_int16

Command:

/home/test-user/llvm_bench_workdir/sycl-bench-build/segmentedreduction --warmup-run --num-runs=3 --output=/home/test-user/llvm_bench_workdir/Pattern_SegmentedReduction_multi.csv --size=102400000

Pattern_SegmentedReduction_Hierarchical_int32

Command:

/home/test-user/llvm_bench_workdir/sycl-bench-build/segmentedreduction --warmup-run --num-runs=3 --output=/home/test-user/llvm_bench_workdir/Pattern_SegmentedReduction_multi.csv --size=102400000

Pattern_SegmentedReduction_Hierarchical_int64

Command:

/home/test-user/llvm_bench_workdir/sycl-bench-build/segmentedreduction --warmup-run --num-runs=3 --output=/home/test-user/llvm_bench_workdir/Pattern_SegmentedReduction_multi.csv --size=102400000

Pattern_SegmentedReduction_Hierarchical_fp32

Command:

/home/test-user/llvm_bench_workdir/sycl-bench-build/segmentedreduction --warmup-run --num-runs=3 --output=/home/test-user/llvm_bench_workdir/Pattern_SegmentedReduction_multi.csv --size=102400000

USM_Allocation_latency_fp32_device

Command:

/home/test-user/llvm_bench_workdir/sycl-bench-build/usm_allocation_latency --warmup-run --num-runs=3 --output=/home/test-user/llvm_bench_workdir/USM_Allocation_latency_multi.csv --size=1024000000

USM_Allocation_latency_fp32_host

Command:

/home/test-user/llvm_bench_workdir/sycl-bench-build/usm_allocation_latency --warmup-run --num-runs=3 --output=/home/test-user/llvm_bench_workdir/USM_Allocation_latency_multi.csv --size=1024000000

USM_Allocation_latency_fp32_shared

Command:

/home/test-user/llvm_bench_workdir/sycl-bench-build/usm_allocation_latency --warmup-run --num-runs=3 --output=/home/test-user/llvm_bench_workdir/USM_Allocation_latency_multi.csv --size=1024000000

USM_Instr_Mix_fp32_device_1:1mix_with_init_no_prefetch

Command:

/home/test-user/llvm_bench_workdir/sycl-bench-build/usm_instr_mix --warmup-run --num-runs=3 --output=/home/test-user/llvm_bench_workdir/USM_Instr_Mix_multi.csv --size=8192

USM_Instr_Mix_fp32_host_1:1mix_with_init_no_prefetch

Command:

/home/test-user/llvm_bench_workdir/sycl-bench-build/usm_instr_mix --warmup-run --num-runs=3 --output=/home/test-user/llvm_bench_workdir/USM_Instr_Mix_multi.csv --size=8192

USM_Instr_Mix_fp32_device_1:1mix_no_init_no_prefetch

Command:

/home/test-user/llvm_bench_workdir/sycl-bench-build/usm_instr_mix --warmup-run --num-runs=3 --output=/home/test-user/llvm_bench_workdir/USM_Instr_Mix_multi.csv --size=8192

USM_Instr_Mix_fp32_host_1:1mix_no_init_no_prefetch

Command:

/home/test-user/llvm_bench_workdir/sycl-bench-build/usm_instr_mix --warmup-run --num-runs=3 --output=/home/test-user/llvm_bench_workdir/USM_Instr_Mix_multi.csv --size=8192

VectorAddition_int32

Command:

/home/test-user/llvm_bench_workdir/sycl-bench-build/vec_add --warmup-run --num-runs=3 --output=/home/test-user/llvm_bench_workdir/VectorAddition_multi.csv --size=102400000

VectorAddition_int64

Command:

/home/test-user/llvm_bench_workdir/sycl-bench-build/vec_add --warmup-run --num-runs=3 --output=/home/test-user/llvm_bench_workdir/VectorAddition_multi.csv --size=102400000

VectorAddition_fp32

Command:

/home/test-user/llvm_bench_workdir/sycl-bench-build/vec_add --warmup-run --num-runs=3 --output=/home/test-user/llvm_bench_workdir/VectorAddition_multi.csv --size=102400000

Polybench_2mm

Command:

/home/test-user/llvm_bench_workdir/sycl-bench-build/2mm --warmup-run --num-runs=3 --output=/home/test-user/llvm_bench_workdir/2mm.csv --size=512

Polybench_3mm

Command:

/home/test-user/llvm_bench_workdir/sycl-bench-build/3mm --warmup-run --num-runs=3 --output=/home/test-user/llvm_bench_workdir/3mm.csv --size=512

Polybench_Atax

Command:

/home/test-user/llvm_bench_workdir/sycl-bench-build/atax --warmup-run --num-runs=3 --output=/home/test-user/llvm_bench_workdir/Atax.csv --size=8192

Kmeans_fp32

Command:

/home/test-user/llvm_bench_workdir/sycl-bench-build/kmeans --warmup-run --num-runs=3 --output=/home/test-user/llvm_bench_workdir/Kmeans.csv --size=700000000

MolecularDynamics

Command:

/home/test-user/llvm_bench_workdir/sycl-bench-build/mol_dyn --warmup-run --num-runs=3 --output=/home/test-user/llvm_bench_workdir/MolecularDynamics.csv --size=8196

llama.cpp Prompt Processing Batched 128

Command:

/home/test-user/llvm_bench_workdir/llamacpp-build/bin/llama-bench --output csv -n 128 -p 512 -b 128,256,512 --numa isolate -t 56 --model /home/test-user/llvm_bench_workdir/models/Phi-3-mini-4k-instruct-q4.gguf

llama.cpp Text Generation Batched 128

Command:

/home/test-user/llvm_bench_workdir/llamacpp-build/bin/llama-bench --output csv -n 128 -p 512 -b 128,256,512 --numa isolate -t 56 --model /home/test-user/llvm_bench_workdir/models/Phi-3-mini-4k-instruct-q4.gguf

llama.cpp Prompt Processing Batched 256

Command:

/home/test-user/llvm_bench_workdir/llamacpp-build/bin/llama-bench --output csv -n 128 -p 512 -b 128,256,512 --numa isolate -t 56 --model /home/test-user/llvm_bench_workdir/models/Phi-3-mini-4k-instruct-q4.gguf

llama.cpp Text Generation Batched 256

Command:

/home/test-user/llvm_bench_workdir/llamacpp-build/bin/llama-bench --output csv -n 128 -p 512 -b 128,256,512 --numa isolate -t 56 --model /home/test-user/llvm_bench_workdir/models/Phi-3-mini-4k-instruct-q4.gguf

llama.cpp Prompt Processing Batched 512

Command:

/home/test-user/llvm_bench_workdir/llamacpp-build/bin/llama-bench --output csv -n 128 -p 512 -b 128,256,512 --numa isolate -t 56 --model /home/test-user/llvm_bench_workdir/models/Phi-3-mini-4k-instruct-q4.gguf

llama.cpp Text Generation Batched 512

Command:

/home/test-user/llvm_bench_workdir/llamacpp-build/bin/llama-bench --output csv -n 128 -p 512 -b 128,256,512 --numa isolate -t 56 --model /home/test-user/llvm_bench_workdir/models/Phi-3-mini-4k-instruct-q4.gguf

Copy link
Contributor

Compute Benchmarks level_zero_v2 run (with params: ):
https://github.com/intel/llvm/actions/runs/13816923286

Copy link
Contributor

Benchmarks level_zero_v2 run ():
https://github.com/intel/llvm/actions/runs/13816923286
Job status: success. Test status: success.

Summary

(Emphasized values are the best results)
No diffs to calculate performance change

Performance change in benchmark groups

Compute Benchmarks
Relative perf in group SubmitKernel (4)
Benchmark This PR
api_overhead_benchmark_l0 SubmitKernel out of order 11.887000 μs
api_overhead_benchmark_l0 SubmitKernel in order 11.873000 μs
api_overhead_benchmark_sycl SubmitKernel out of order 24.920000 μs
api_overhead_benchmark_sycl SubmitKernel in order 25.791000 μs
Relative perf in group Other (6)
Benchmark This PR
memory_benchmark_sycl QueueInOrderMemcpy from Device to Device, size 1024 264.480000 μs
memory_benchmark_sycl QueueInOrderMemcpy from Host to Device, size 1024 142.499000 μs
memory_benchmark_sycl QueueMemcpy from Device to Device, size 1024 6.021000 μs
api_overhead_benchmark_sycl ExecImmediateCopyQueue out of order from Device to Device, size 1024 2.148000 μs
api_overhead_benchmark_sycl ExecImmediateCopyQueue in order from Device to Host, size 1024 1.763000 μs
miscellaneous_benchmark_sycl VectorSum 859.195000 bw GB/s
Relative perf in group SinKernelGraph 5 (3)
Benchmark This PR
graph_api_benchmark_sycl SinKernelGraph graphs:0, numKernels:5 28.871000 μs
graph_api_benchmark_l0 SinKernelGraph graphs:0, numKernels:5 25.549000 μs
graph_api_benchmark_l0 SinKernelGraph graphs:1, numKernels:5 28.676000 μs
Relative perf in group SinKernelGraph 100 (3)
Benchmark This PR
graph_api_benchmark_sycl SinKernelGraph graphs:0, numKernels:100 320.949000 μs
graph_api_benchmark_l0 SinKernelGraph graphs:0, numKernels:100 249.025000 μs
graph_api_benchmark_l0 SinKernelGraph graphs:1, numKernels:100 251.866000 μs
Velocity Bench
Relative perf in group Other (8)
Benchmark This PR
Velocity-Bench Hashtable 309.159275 M keys/sec
Velocity-Bench Bitcracker 35.365600 s
Velocity-Bench CudaSift 206.628000 ms
Velocity-Bench QuickSilver 117.800000 MMS/CTT
Velocity-Bench Sobel Filter 715.868000 ms
Velocity-Bench dl-cifar 24.974700 s
Velocity-Bench dl-mnist 2.670000 s
Velocity-Bench svm 0.160400 s
SYCL-Bench
Relative perf in group Other (54)
Benchmark This PR
Runtime_IndependentDAGTaskThroughput_SingleTask 247.887000 ms
Runtime_IndependentDAGTaskThroughput_BasicParallelFor 256.690000 ms
Runtime_IndependentDAGTaskThroughput_HierarchicalParallelFor 259.495000 ms
Runtime_IndependentDAGTaskThroughput_NDRangeParallelFor 255.101000 ms
Runtime_DAGTaskThroughput_SingleTask 1728.113000 ms
Runtime_DAGTaskThroughput_BasicParallelFor 1806.508000 ms
Runtime_DAGTaskThroughput_HierarchicalParallelFor 1772.458000 ms
Runtime_DAGTaskThroughput_NDRangeParallelFor 1743.728000 ms
MicroBench_HostDeviceBandwidth_1D_H2D_Contiguous 6.112000 ms
MicroBench_HostDeviceBandwidth_2D_H2D_Contiguous 6.161000 ms
MicroBench_HostDeviceBandwidth_3D_H2D_Contiguous 6.058000 ms
MicroBench_HostDeviceBandwidth_1D_D2H_Contiguous 6.065000 ms
MicroBench_HostDeviceBandwidth_2D_D2H_Contiguous 652.959000 ms
MicroBench_HostDeviceBandwidth_3D_D2H_Contiguous 652.927000 ms
MicroBench_HostDeviceBandwidth_1D_H2D_Strided 5.997000 ms
MicroBench_HostDeviceBandwidth_2D_H2D_Strided 6.330000 ms
MicroBench_HostDeviceBandwidth_3D_H2D_Strided 6.257000 ms
MicroBench_HostDeviceBandwidth_1D_D2H_Strided 6.071000 ms
MicroBench_HostDeviceBandwidth_2D_D2H_Strided 652.775000 ms
MicroBench_HostDeviceBandwidth_3D_D2H_Strided 652.839000 ms
MicroBench_LocalMem_int32_4096 29.650000 ms
MicroBench_LocalMem_fp32_4096 29.727000 ms
Pattern_Reduction_NDRange_int32 13.883000 ms
Pattern_Reduction_Hierarchical_int32 12.207000 ms
ScalarProduct_NDRange_int32 3.864000 ms
ScalarProduct_NDRange_int64 5.547000 ms
ScalarProduct_NDRange_fp32 3.876000 ms
ScalarProduct_Hierarchical_int32 11.332000 ms
ScalarProduct_Hierarchical_int64 12.294000 ms
ScalarProduct_Hierarchical_fp32 11.007000 ms
Pattern_SegmentedReduction_NDRange_int16 2.409000 ms
Pattern_SegmentedReduction_NDRange_int32 2.314000 ms
Pattern_SegmentedReduction_NDRange_int64 2.506000 ms
Pattern_SegmentedReduction_NDRange_fp32 2.319000 ms
Pattern_SegmentedReduction_Hierarchical_int16 12.405000 ms
Pattern_SegmentedReduction_Hierarchical_int32 12.307000 ms
Pattern_SegmentedReduction_Hierarchical_int64 12.493000 ms
Pattern_SegmentedReduction_Hierarchical_fp32 12.300000 ms
USM_Allocation_latency_fp32_device 0.058000 ms
USM_Allocation_latency_fp32_host 19.023000 ms
USM_Allocation_latency_fp32_shared 0.060000 ms
USM_Instr_Mix_fp32_device_1:1mix_with_init_no_prefetch 1.690000 ms
USM_Instr_Mix_fp32_host_1:1mix_with_init_no_prefetch 1.072000 ms
USM_Instr_Mix_fp32_device_1:1mix_no_init_no_prefetch 1.844000 ms
USM_Instr_Mix_fp32_host_1:1mix_no_init_no_prefetch 1.232000 ms
VectorAddition_int32 1.517000 ms
VectorAddition_int64 3.099000 ms
VectorAddition_fp32 1.502000 ms
Polybench_2mm 1.272000 ms
Polybench_3mm 1.796000 ms
Polybench_Atax 7.053000 ms
Kmeans_fp32 16.105000 ms
LinearRegressionCoeff_fp32 1124.424000 ms
MolecularDynamics 0.031000 ms
llama.cpp bench
Relative perf in group Other (6)
Benchmark This PR
llama.cpp Prompt Processing Batched 128 877.032946 token/s
llama.cpp Text Generation Batched 128 61.433078 token/s
llama.cpp Prompt Processing Batched 256 943.320936 token/s
llama.cpp Text Generation Batched 256 61.459471 token/s
llama.cpp Prompt Processing Batched 512 541.074918 token/s
llama.cpp Text Generation Batched 512 61.421224 token/s

Details

Benchmark details - environment, command...
api_overhead_benchmark_l0 SubmitKernel out of order

Command:

/home/test-user/llvm_bench_workdir/compute-benchmarks-build/bin/api_overhead_benchmark_l0 --test=SubmitKernel --csv --noHeaders --Ioq=0 --DiscardEvents=0 --MeasureCompletion=0 --iterations=100000 --Profiling=0 --NumKernels=10 --KernelExecTime=1

api_overhead_benchmark_l0 SubmitKernel in order

Command:

/home/test-user/llvm_bench_workdir/compute-benchmarks-build/bin/api_overhead_benchmark_l0 --test=SubmitKernel --csv --noHeaders --Ioq=1 --DiscardEvents=0 --MeasureCompletion=0 --iterations=100000 --Profiling=0 --NumKernels=10 --KernelExecTime=1

api_overhead_benchmark_sycl SubmitKernel out of order

Command:

/home/test-user/llvm_bench_workdir/compute-benchmarks-build/bin/api_overhead_benchmark_sycl --test=SubmitKernel --csv --noHeaders --Ioq=0 --DiscardEvents=0 --MeasureCompletion=0 --iterations=100000 --Profiling=0 --NumKernels=10 --KernelExecTime=1

api_overhead_benchmark_sycl SubmitKernel in order

Command:

/home/test-user/llvm_bench_workdir/compute-benchmarks-build/bin/api_overhead_benchmark_sycl --test=SubmitKernel --csv --noHeaders --Ioq=1 --DiscardEvents=0 --MeasureCompletion=0 --iterations=100000 --Profiling=0 --NumKernels=10 --KernelExecTime=1

memory_benchmark_sycl QueueInOrderMemcpy from Device to Device, size 1024

Command:

/home/test-user/llvm_bench_workdir/compute-benchmarks-build/bin/memory_benchmark_sycl --test=QueueInOrderMemcpy --csv --noHeaders --iterations=10000 --IsCopyOnly=0 --sourcePlacement=Device --destinationPlacement=Device --size=1024 --count=100

memory_benchmark_sycl QueueInOrderMemcpy from Host to Device, size 1024

Command:

/home/test-user/llvm_bench_workdir/compute-benchmarks-build/bin/memory_benchmark_sycl --test=QueueInOrderMemcpy --csv --noHeaders --iterations=10000 --IsCopyOnly=0 --sourcePlacement=Host --destinationPlacement=Device --size=1024 --count=100

memory_benchmark_sycl QueueMemcpy from Device to Device, size 1024

Command:

/home/test-user/llvm_bench_workdir/compute-benchmarks-build/bin/memory_benchmark_sycl --test=QueueMemcpy --csv --noHeaders --iterations=10000 --sourcePlacement=Device --destinationPlacement=Device --size=1024

api_overhead_benchmark_sycl ExecImmediateCopyQueue out of order from Device to Device, size 1024

Command:

/home/test-user/llvm_bench_workdir/compute-benchmarks-build/bin/api_overhead_benchmark_sycl --test=ExecImmediateCopyQueue --csv --noHeaders --iterations=100000 --ioq=0 --IsCopyOnly=1 --MeasureCompletionTime=0 --src=Device --dst=Device --size=1024

api_overhead_benchmark_sycl ExecImmediateCopyQueue in order from Device to Host, size 1024

Command:

/home/test-user/llvm_bench_workdir/compute-benchmarks-build/bin/api_overhead_benchmark_sycl --test=ExecImmediateCopyQueue --csv --noHeaders --iterations=100000 --ioq=1 --IsCopyOnly=1 --MeasureCompletionTime=0 --src=Host --dst=Host --size=1024

miscellaneous_benchmark_sycl VectorSum

Command:

/home/test-user/llvm_bench_workdir/compute-benchmarks-build/bin/miscellaneous_benchmark_sycl --test=VectorSum --csv --noHeaders --iterations=1000 --numberOfElementsX=512 --numberOfElementsY=256 --numberOfElementsZ=256

graph_api_benchmark_sycl SinKernelGraph graphs:0, numKernels:5

Command:

/home/test-user/llvm_bench_workdir/compute-benchmarks-build/bin/graph_api_benchmark_sycl --test=SinKernelGraph --csv --noHeaders --iterations=10000 --numKernels=5 --withGraphs=0 --withCopyOffload=1 --immediateAppendCmdList=0

graph_api_benchmark_sycl SinKernelGraph graphs:0, numKernels:100

Command:

/home/test-user/llvm_bench_workdir/compute-benchmarks-build/bin/graph_api_benchmark_sycl --test=SinKernelGraph --csv --noHeaders --iterations=10000 --numKernels=100 --withGraphs=0 --withCopyOffload=1 --immediateAppendCmdList=0

graph_api_benchmark_l0 SinKernelGraph graphs:0, numKernels:5

Command:

/home/test-user/llvm_bench_workdir/compute-benchmarks-build/bin/graph_api_benchmark_l0 --test=SinKernelGraph --csv --noHeaders --iterations=10000 --numKernels=5 --withGraphs=0 --withCopyOffload=1 --immediateAppendCmdList=0

graph_api_benchmark_l0 SinKernelGraph graphs:1, numKernels:5

Command:

/home/test-user/llvm_bench_workdir/compute-benchmarks-build/bin/graph_api_benchmark_l0 --test=SinKernelGraph --csv --noHeaders --iterations=10000 --numKernels=5 --withGraphs=1 --withCopyOffload=1 --immediateAppendCmdList=0

graph_api_benchmark_l0 SinKernelGraph graphs:0, numKernels:100

Command:

/home/test-user/llvm_bench_workdir/compute-benchmarks-build/bin/graph_api_benchmark_l0 --test=SinKernelGraph --csv --noHeaders --iterations=10000 --numKernels=100 --withGraphs=0 --withCopyOffload=1 --immediateAppendCmdList=0

graph_api_benchmark_l0 SinKernelGraph graphs:1, numKernels:100

Command:

/home/test-user/llvm_bench_workdir/compute-benchmarks-build/bin/graph_api_benchmark_l0 --test=SinKernelGraph --csv --noHeaders --iterations=10000 --numKernels=100 --withGraphs=1 --withCopyOffload=1 --immediateAppendCmdList=0

Velocity-Bench Hashtable

Command:

/home/test-user/llvm_bench_workdir/hashtable/hashtable_sycl --no-verify

Velocity-Bench Bitcracker

Command:

/home/test-user/llvm_bench_workdir/bitcracker/bitcracker -f /home/test-user/llvm_bench_workdir/velocity-bench-repo/bitcracker/hash_pass/img_win8_user_hash.txt -d /home/test-user/llvm_bench_workdir/velocity-bench-repo/bitcracker/hash_pass/user_passwords_60000.txt -b 60000

Velocity-Bench CudaSift

Command:

/home/test-user/llvm_bench_workdir/cudaSift/cudaSift

Velocity-Bench QuickSilver

Command:

/home/test-user/llvm_bench_workdir/QuickSilver/qs -i /home/test-user/llvm_bench_workdir/velocity-bench-repo/QuickSilver/Examples/AllScattering/scatteringOnly.inp

Environment Variables:

QS_DEVICE=GPU

Velocity-Bench Sobel Filter

Command:

/home/test-user/llvm_bench_workdir/sobel_filter/sobel_filter -i /home/test-user/llvm_bench_workdir/data/sobel_filter/sobel_filter_data/silverfalls_32Kx32K.png -n 5

Environment Variables:

OPENCV_IO_MAX_IMAGE_PIXELS=1677721600

Velocity-Bench dl-cifar

Command:

/home/test-user/llvm_bench_workdir/dl-cifar/dl-cifar_sycl

Velocity-Bench dl-mnist

Command:

/home/test-user/llvm_bench_workdir/dl-mnist/dl-mnist-sycl -conv_algo ONEDNN_AUTO

Environment Variables:

NEOReadDebugKeys=1
DisableScratchPages=0

Velocity-Bench svm

Command:

/home/test-user/llvm_bench_workdir/svm/svm_sycl /home/test-user/llvm_bench_workdir/velocity-bench-repo/svm/SYCL/a9a /home/test-user/llvm_bench_workdir/velocity-bench-repo/svm/SYCL/a.m

Runtime_IndependentDAGTaskThroughput_SingleTask

Command:

/home/test-user/llvm_bench_workdir/sycl-bench-build/dag_task_throughput_independent --warmup-run --num-runs=3 --output=/home/test-user/llvm_bench_workdir/IndependentDAGTaskThroughput_multi.csv --size=32768

Runtime_IndependentDAGTaskThroughput_BasicParallelFor

Command:

/home/test-user/llvm_bench_workdir/sycl-bench-build/dag_task_throughput_independent --warmup-run --num-runs=3 --output=/home/test-user/llvm_bench_workdir/IndependentDAGTaskThroughput_multi.csv --size=32768

Runtime_IndependentDAGTaskThroughput_HierarchicalParallelFor

Command:

/home/test-user/llvm_bench_workdir/sycl-bench-build/dag_task_throughput_independent --warmup-run --num-runs=3 --output=/home/test-user/llvm_bench_workdir/IndependentDAGTaskThroughput_multi.csv --size=32768

Runtime_IndependentDAGTaskThroughput_NDRangeParallelFor

Command:

/home/test-user/llvm_bench_workdir/sycl-bench-build/dag_task_throughput_independent --warmup-run --num-runs=3 --output=/home/test-user/llvm_bench_workdir/IndependentDAGTaskThroughput_multi.csv --size=32768

Runtime_DAGTaskThroughput_SingleTask

Command:

/home/test-user/llvm_bench_workdir/sycl-bench-build/dag_task_throughput_sequential --warmup-run --num-runs=3 --output=/home/test-user/llvm_bench_workdir/DAGTaskThroughput_multi.csv --size=327680

Runtime_DAGTaskThroughput_BasicParallelFor

Command:

/home/test-user/llvm_bench_workdir/sycl-bench-build/dag_task_throughput_sequential --warmup-run --num-runs=3 --output=/home/test-user/llvm_bench_workdir/DAGTaskThroughput_multi.csv --size=327680

Runtime_DAGTaskThroughput_HierarchicalParallelFor

Command:

/home/test-user/llvm_bench_workdir/sycl-bench-build/dag_task_throughput_sequential --warmup-run --num-runs=3 --output=/home/test-user/llvm_bench_workdir/DAGTaskThroughput_multi.csv --size=327680

Runtime_DAGTaskThroughput_NDRangeParallelFor

Command:

/home/test-user/llvm_bench_workdir/sycl-bench-build/dag_task_throughput_sequential --warmup-run --num-runs=3 --output=/home/test-user/llvm_bench_workdir/DAGTaskThroughput_multi.csv --size=327680

MicroBench_HostDeviceBandwidth_1D_H2D_Contiguous

Command:

/home/test-user/llvm_bench_workdir/sycl-bench-build/host_device_bandwidth --warmup-run --num-runs=3 --output=/home/test-user/llvm_bench_workdir/HostDeviceBandwidth_multi.csv --size=512

MicroBench_HostDeviceBandwidth_2D_H2D_Contiguous

Command:

/home/test-user/llvm_bench_workdir/sycl-bench-build/host_device_bandwidth --warmup-run --num-runs=3 --output=/home/test-user/llvm_bench_workdir/HostDeviceBandwidth_multi.csv --size=512

MicroBench_HostDeviceBandwidth_3D_H2D_Contiguous

Command:

/home/test-user/llvm_bench_workdir/sycl-bench-build/host_device_bandwidth --warmup-run --num-runs=3 --output=/home/test-user/llvm_bench_workdir/HostDeviceBandwidth_multi.csv --size=512

MicroBench_HostDeviceBandwidth_1D_D2H_Contiguous

Command:

/home/test-user/llvm_bench_workdir/sycl-bench-build/host_device_bandwidth --warmup-run --num-runs=3 --output=/home/test-user/llvm_bench_workdir/HostDeviceBandwidth_multi.csv --size=512

MicroBench_HostDeviceBandwidth_2D_D2H_Contiguous

Command:

/home/test-user/llvm_bench_workdir/sycl-bench-build/host_device_bandwidth --warmup-run --num-runs=3 --output=/home/test-user/llvm_bench_workdir/HostDeviceBandwidth_multi.csv --size=512

MicroBench_HostDeviceBandwidth_3D_D2H_Contiguous

Command:

/home/test-user/llvm_bench_workdir/sycl-bench-build/host_device_bandwidth --warmup-run --num-runs=3 --output=/home/test-user/llvm_bench_workdir/HostDeviceBandwidth_multi.csv --size=512

MicroBench_HostDeviceBandwidth_1D_H2D_Strided

Command:

/home/test-user/llvm_bench_workdir/sycl-bench-build/host_device_bandwidth --warmup-run --num-runs=3 --output=/home/test-user/llvm_bench_workdir/HostDeviceBandwidth_multi.csv --size=512

MicroBench_HostDeviceBandwidth_2D_H2D_Strided

Command:

/home/test-user/llvm_bench_workdir/sycl-bench-build/host_device_bandwidth --warmup-run --num-runs=3 --output=/home/test-user/llvm_bench_workdir/HostDeviceBandwidth_multi.csv --size=512

MicroBench_HostDeviceBandwidth_3D_H2D_Strided

Command:

/home/test-user/llvm_bench_workdir/sycl-bench-build/host_device_bandwidth --warmup-run --num-runs=3 --output=/home/test-user/llvm_bench_workdir/HostDeviceBandwidth_multi.csv --size=512

MicroBench_HostDeviceBandwidth_1D_D2H_Strided

Command:

/home/test-user/llvm_bench_workdir/sycl-bench-build/host_device_bandwidth --warmup-run --num-runs=3 --output=/home/test-user/llvm_bench_workdir/HostDeviceBandwidth_multi.csv --size=512

MicroBench_HostDeviceBandwidth_2D_D2H_Strided

Command:

/home/test-user/llvm_bench_workdir/sycl-bench-build/host_device_bandwidth --warmup-run --num-runs=3 --output=/home/test-user/llvm_bench_workdir/HostDeviceBandwidth_multi.csv --size=512

MicroBench_HostDeviceBandwidth_3D_D2H_Strided

Command:

/home/test-user/llvm_bench_workdir/sycl-bench-build/host_device_bandwidth --warmup-run --num-runs=3 --output=/home/test-user/llvm_bench_workdir/HostDeviceBandwidth_multi.csv --size=512

MicroBench_LocalMem_int32_4096

Command:

/home/test-user/llvm_bench_workdir/sycl-bench-build/local_mem --warmup-run --num-runs=3 --output=/home/test-user/llvm_bench_workdir/LocalMem_multi.csv --size=10240000

MicroBench_LocalMem_fp32_4096

Command:

/home/test-user/llvm_bench_workdir/sycl-bench-build/local_mem --warmup-run --num-runs=3 --output=/home/test-user/llvm_bench_workdir/LocalMem_multi.csv --size=10240000

Pattern_Reduction_NDRange_int32

Command:

/home/test-user/llvm_bench_workdir/sycl-bench-build/reduction --warmup-run --num-runs=3 --output=/home/test-user/llvm_bench_workdir/Pattern_Reduction_multi.csv --size=10240000

Pattern_Reduction_Hierarchical_int32

Command:

/home/test-user/llvm_bench_workdir/sycl-bench-build/reduction --warmup-run --num-runs=3 --output=/home/test-user/llvm_bench_workdir/Pattern_Reduction_multi.csv --size=10240000

ScalarProduct_NDRange_int32

Command:

/home/test-user/llvm_bench_workdir/sycl-bench-build/scalar_prod --warmup-run --num-runs=3 --output=/home/test-user/llvm_bench_workdir/ScalarProduct_multi.csv --size=102400000

ScalarProduct_NDRange_int64

Command:

/home/test-user/llvm_bench_workdir/sycl-bench-build/scalar_prod --warmup-run --num-runs=3 --output=/home/test-user/llvm_bench_workdir/ScalarProduct_multi.csv --size=102400000

ScalarProduct_NDRange_fp32

Command:

/home/test-user/llvm_bench_workdir/sycl-bench-build/scalar_prod --warmup-run --num-runs=3 --output=/home/test-user/llvm_bench_workdir/ScalarProduct_multi.csv --size=102400000

ScalarProduct_Hierarchical_int32

Command:

/home/test-user/llvm_bench_workdir/sycl-bench-build/scalar_prod --warmup-run --num-runs=3 --output=/home/test-user/llvm_bench_workdir/ScalarProduct_multi.csv --size=102400000

ScalarProduct_Hierarchical_int64

Command:

/home/test-user/llvm_bench_workdir/sycl-bench-build/scalar_prod --warmup-run --num-runs=3 --output=/home/test-user/llvm_bench_workdir/ScalarProduct_multi.csv --size=102400000

ScalarProduct_Hierarchical_fp32

Command:

/home/test-user/llvm_bench_workdir/sycl-bench-build/scalar_prod --warmup-run --num-runs=3 --output=/home/test-user/llvm_bench_workdir/ScalarProduct_multi.csv --size=102400000

Pattern_SegmentedReduction_NDRange_int16

Command:

/home/test-user/llvm_bench_workdir/sycl-bench-build/segmentedreduction --warmup-run --num-runs=3 --output=/home/test-user/llvm_bench_workdir/Pattern_SegmentedReduction_multi.csv --size=102400000

Pattern_SegmentedReduction_NDRange_int32

Command:

/home/test-user/llvm_bench_workdir/sycl-bench-build/segmentedreduction --warmup-run --num-runs=3 --output=/home/test-user/llvm_bench_workdir/Pattern_SegmentedReduction_multi.csv --size=102400000

Pattern_SegmentedReduction_NDRange_int64

Command:

/home/test-user/llvm_bench_workdir/sycl-bench-build/segmentedreduction --warmup-run --num-runs=3 --output=/home/test-user/llvm_bench_workdir/Pattern_SegmentedReduction_multi.csv --size=102400000

Pattern_SegmentedReduction_NDRange_fp32

Command:

/home/test-user/llvm_bench_workdir/sycl-bench-build/segmentedreduction --warmup-run --num-runs=3 --output=/home/test-user/llvm_bench_workdir/Pattern_SegmentedReduction_multi.csv --size=102400000

Pattern_SegmentedReduction_Hierarchical_int16

Command:

/home/test-user/llvm_bench_workdir/sycl-bench-build/segmentedreduction --warmup-run --num-runs=3 --output=/home/test-user/llvm_bench_workdir/Pattern_SegmentedReduction_multi.csv --size=102400000

Pattern_SegmentedReduction_Hierarchical_int32

Command:

/home/test-user/llvm_bench_workdir/sycl-bench-build/segmentedreduction --warmup-run --num-runs=3 --output=/home/test-user/llvm_bench_workdir/Pattern_SegmentedReduction_multi.csv --size=102400000

Pattern_SegmentedReduction_Hierarchical_int64

Command:

/home/test-user/llvm_bench_workdir/sycl-bench-build/segmentedreduction --warmup-run --num-runs=3 --output=/home/test-user/llvm_bench_workdir/Pattern_SegmentedReduction_multi.csv --size=102400000

Pattern_SegmentedReduction_Hierarchical_fp32

Command:

/home/test-user/llvm_bench_workdir/sycl-bench-build/segmentedreduction --warmup-run --num-runs=3 --output=/home/test-user/llvm_bench_workdir/Pattern_SegmentedReduction_multi.csv --size=102400000

USM_Allocation_latency_fp32_device

Command:

/home/test-user/llvm_bench_workdir/sycl-bench-build/usm_allocation_latency --warmup-run --num-runs=3 --output=/home/test-user/llvm_bench_workdir/USM_Allocation_latency_multi.csv --size=1024000000

USM_Allocation_latency_fp32_host

Command:

/home/test-user/llvm_bench_workdir/sycl-bench-build/usm_allocation_latency --warmup-run --num-runs=3 --output=/home/test-user/llvm_bench_workdir/USM_Allocation_latency_multi.csv --size=1024000000

USM_Allocation_latency_fp32_shared

Command:

/home/test-user/llvm_bench_workdir/sycl-bench-build/usm_allocation_latency --warmup-run --num-runs=3 --output=/home/test-user/llvm_bench_workdir/USM_Allocation_latency_multi.csv --size=1024000000

USM_Instr_Mix_fp32_device_1:1mix_with_init_no_prefetch

Command:

/home/test-user/llvm_bench_workdir/sycl-bench-build/usm_instr_mix --warmup-run --num-runs=3 --output=/home/test-user/llvm_bench_workdir/USM_Instr_Mix_multi.csv --size=8192

USM_Instr_Mix_fp32_host_1:1mix_with_init_no_prefetch

Command:

/home/test-user/llvm_bench_workdir/sycl-bench-build/usm_instr_mix --warmup-run --num-runs=3 --output=/home/test-user/llvm_bench_workdir/USM_Instr_Mix_multi.csv --size=8192

USM_Instr_Mix_fp32_device_1:1mix_no_init_no_prefetch

Command:

/home/test-user/llvm_bench_workdir/sycl-bench-build/usm_instr_mix --warmup-run --num-runs=3 --output=/home/test-user/llvm_bench_workdir/USM_Instr_Mix_multi.csv --size=8192

USM_Instr_Mix_fp32_host_1:1mix_no_init_no_prefetch

Command:

/home/test-user/llvm_bench_workdir/sycl-bench-build/usm_instr_mix --warmup-run --num-runs=3 --output=/home/test-user/llvm_bench_workdir/USM_Instr_Mix_multi.csv --size=8192

VectorAddition_int32

Command:

/home/test-user/llvm_bench_workdir/sycl-bench-build/vec_add --warmup-run --num-runs=3 --output=/home/test-user/llvm_bench_workdir/VectorAddition_multi.csv --size=102400000

VectorAddition_int64

Command:

/home/test-user/llvm_bench_workdir/sycl-bench-build/vec_add --warmup-run --num-runs=3 --output=/home/test-user/llvm_bench_workdir/VectorAddition_multi.csv --size=102400000

VectorAddition_fp32

Command:

/home/test-user/llvm_bench_workdir/sycl-bench-build/vec_add --warmup-run --num-runs=3 --output=/home/test-user/llvm_bench_workdir/VectorAddition_multi.csv --size=102400000

Polybench_2mm

Command:

/home/test-user/llvm_bench_workdir/sycl-bench-build/2mm --warmup-run --num-runs=3 --output=/home/test-user/llvm_bench_workdir/2mm.csv --size=512

Polybench_3mm

Command:

/home/test-user/llvm_bench_workdir/sycl-bench-build/3mm --warmup-run --num-runs=3 --output=/home/test-user/llvm_bench_workdir/3mm.csv --size=512

Polybench_Atax

Command:

/home/test-user/llvm_bench_workdir/sycl-bench-build/atax --warmup-run --num-runs=3 --output=/home/test-user/llvm_bench_workdir/Atax.csv --size=8192

Kmeans_fp32

Command:

/home/test-user/llvm_bench_workdir/sycl-bench-build/kmeans --warmup-run --num-runs=3 --output=/home/test-user/llvm_bench_workdir/Kmeans.csv --size=700000000

LinearRegressionCoeff_fp32

Command:

/home/test-user/llvm_bench_workdir/sycl-bench-build/lin_reg_coeff --warmup-run --num-runs=3 --output=/home/test-user/llvm_bench_workdir/LinearRegressionCoeff.csv --size=1638400000

MolecularDynamics

Command:

/home/test-user/llvm_bench_workdir/sycl-bench-build/mol_dyn --warmup-run --num-runs=3 --output=/home/test-user/llvm_bench_workdir/MolecularDynamics.csv --size=8196

llama.cpp Prompt Processing Batched 128

Command:

/home/test-user/llvm_bench_workdir/llamacpp-build/bin/llama-bench --output csv -n 128 -p 512 -b 128,256,512 --numa isolate -t 56 --model /home/test-user/llvm_bench_workdir/models/Phi-3-mini-4k-instruct-q4.gguf

llama.cpp Text Generation Batched 128

Command:

/home/test-user/llvm_bench_workdir/llamacpp-build/bin/llama-bench --output csv -n 128 -p 512 -b 128,256,512 --numa isolate -t 56 --model /home/test-user/llvm_bench_workdir/models/Phi-3-mini-4k-instruct-q4.gguf

llama.cpp Prompt Processing Batched 256

Command:

/home/test-user/llvm_bench_workdir/llamacpp-build/bin/llama-bench --output csv -n 128 -p 512 -b 128,256,512 --numa isolate -t 56 --model /home/test-user/llvm_bench_workdir/models/Phi-3-mini-4k-instruct-q4.gguf

llama.cpp Text Generation Batched 256

Command:

/home/test-user/llvm_bench_workdir/llamacpp-build/bin/llama-bench --output csv -n 128 -p 512 -b 128,256,512 --numa isolate -t 56 --model /home/test-user/llvm_bench_workdir/models/Phi-3-mini-4k-instruct-q4.gguf

llama.cpp Prompt Processing Batched 512

Command:

/home/test-user/llvm_bench_workdir/llamacpp-build/bin/llama-bench --output csv -n 128 -p 512 -b 128,256,512 --numa isolate -t 56 --model /home/test-user/llvm_bench_workdir/models/Phi-3-mini-4k-instruct-q4.gguf

llama.cpp Text Generation Batched 512

Command:

/home/test-user/llvm_bench_workdir/llamacpp-build/bin/llama-bench --output csv -n 128 -p 512 -b 128,256,512 --numa isolate -t 56 --model /home/test-user/llvm_bench_workdir/models/Phi-3-mini-4k-instruct-q4.gguf

From now on disjoint_pool is part of libumf, instead of being
a separate library.
Copy link
Contributor

Compute Benchmarks level_zero run (with params: --compare baseline):
https://github.com/intel/llvm/actions/runs/13833722370

Copy link
Contributor

Benchmarks level_zero run (--compare baseline):
https://github.com/intel/llvm/actions/runs/13833722370
Job status: success. Test status: success.

Summary

(Emphasized values are the best results)
No diffs to calculate performance change

Performance change in benchmark groups

Compute Benchmarks
Relative perf in group SubmitKernel (4)
Benchmark This PR
api_overhead_benchmark_l0 SubmitKernel out of order 11.813000 μs
api_overhead_benchmark_l0 SubmitKernel in order 11.539000 μs
api_overhead_benchmark_sycl SubmitKernel out of order 24.480000 μs
api_overhead_benchmark_sycl SubmitKernel in order 25.742000 μs
Relative perf in group Other (6)
Benchmark This PR
memory_benchmark_sycl QueueInOrderMemcpy from Device to Device, size 1024 264.778000 μs
memory_benchmark_sycl QueueInOrderMemcpy from Host to Device, size 1024 141.978000 μs
memory_benchmark_sycl QueueMemcpy from Device to Device, size 1024 5.783000 μs
api_overhead_benchmark_sycl ExecImmediateCopyQueue out of order from Device to Device, size 1024 2.195000 μs
api_overhead_benchmark_sycl ExecImmediateCopyQueue in order from Device to Host, size 1024 1.760000 μs
miscellaneous_benchmark_sycl VectorSum 859.195000 bw GB/s
Relative perf in group SinKernelGraph 5 (3)
Benchmark This PR
graph_api_benchmark_sycl SinKernelGraph graphs:0, numKernels:5 29.307000 μs
graph_api_benchmark_l0 SinKernelGraph graphs:0, numKernels:5 25.950000 μs
graph_api_benchmark_l0 SinKernelGraph graphs:1, numKernels:5 28.703000 μs
Relative perf in group SinKernelGraph 100 (3)
Benchmark This PR
graph_api_benchmark_sycl SinKernelGraph graphs:0, numKernels:100 317.816000 μs
graph_api_benchmark_l0 SinKernelGraph graphs:0, numKernels:100 249.738000 μs
graph_api_benchmark_l0 SinKernelGraph graphs:1, numKernels:100 250.464000 μs
Velocity Bench
Relative perf in group Other (8)
Benchmark This PR
Velocity-Bench Hashtable 316.774863 M keys/sec
Velocity-Bench Bitcracker 35.393000 s
Velocity-Bench CudaSift 206.722000 ms
Velocity-Bench QuickSilver 118.130000 MMS/CTT
Velocity-Bench Sobel Filter 741.322000 ms
Velocity-Bench dl-cifar 25.023900 s
Velocity-Bench dl-mnist 2.690000 s
Velocity-Bench svm 0.159800 s
SYCL-Bench
Relative perf in group Other (54)
Benchmark This PR
Runtime_IndependentDAGTaskThroughput_SingleTask 240.625000 ms
Runtime_IndependentDAGTaskThroughput_BasicParallelFor 247.900000 ms
Runtime_IndependentDAGTaskThroughput_HierarchicalParallelFor 251.220000 ms
Runtime_IndependentDAGTaskThroughput_NDRangeParallelFor 246.031000 ms
Runtime_DAGTaskThroughput_SingleTask 1742.063000 ms
Runtime_DAGTaskThroughput_BasicParallelFor 1820.915000 ms
Runtime_DAGTaskThroughput_HierarchicalParallelFor 1779.395000 ms
Runtime_DAGTaskThroughput_NDRangeParallelFor 1778.322000 ms
MicroBench_HostDeviceBandwidth_1D_H2D_Contiguous 6.268000 ms
MicroBench_HostDeviceBandwidth_2D_H2D_Contiguous 6.272000 ms
MicroBench_HostDeviceBandwidth_3D_H2D_Contiguous 6.074000 ms
MicroBench_HostDeviceBandwidth_1D_D2H_Contiguous 6.143000 ms
MicroBench_HostDeviceBandwidth_2D_D2H_Contiguous 652.929000 ms
MicroBench_HostDeviceBandwidth_3D_D2H_Contiguous 652.902000 ms
MicroBench_HostDeviceBandwidth_1D_H2D_Strided 6.056000 ms
MicroBench_HostDeviceBandwidth_2D_H2D_Strided 6.382000 ms
MicroBench_HostDeviceBandwidth_3D_H2D_Strided 6.278000 ms
MicroBench_HostDeviceBandwidth_1D_D2H_Strided 6.190000 ms
MicroBench_HostDeviceBandwidth_2D_D2H_Strided 652.897000 ms
MicroBench_HostDeviceBandwidth_3D_D2H_Strided 652.889000 ms
MicroBench_LocalMem_int32_4096 29.617000 ms
MicroBench_LocalMem_fp32_4096 29.693000 ms
Pattern_Reduction_NDRange_int32 13.676000 ms
Pattern_Reduction_Hierarchical_int32 12.161000 ms
ScalarProduct_NDRange_int32 3.869000 ms
ScalarProduct_NDRange_int64 5.572000 ms
ScalarProduct_NDRange_fp32 3.865000 ms
ScalarProduct_Hierarchical_int32 11.351000 ms
ScalarProduct_Hierarchical_int64 12.333000 ms
ScalarProduct_Hierarchical_fp32 11.033000 ms
Pattern_SegmentedReduction_NDRange_int16 2.409000 ms
Pattern_SegmentedReduction_NDRange_int32 2.315000 ms
Pattern_SegmentedReduction_NDRange_int64 2.505000 ms
Pattern_SegmentedReduction_NDRange_fp32 2.324000 ms
Pattern_SegmentedReduction_Hierarchical_int16 12.410000 ms
Pattern_SegmentedReduction_Hierarchical_int32 12.295000 ms
Pattern_SegmentedReduction_Hierarchical_int64 12.498000 ms
Pattern_SegmentedReduction_Hierarchical_fp32 12.297000 ms
USM_Allocation_latency_fp32_device 0.058000 ms
USM_Allocation_latency_fp32_host 18.966000 ms
USM_Allocation_latency_fp32_shared 0.064000 ms
USM_Instr_Mix_fp32_device_1:1mix_with_init_no_prefetch 1.665000 ms
USM_Instr_Mix_fp32_host_1:1mix_with_init_no_prefetch 1.059000 ms
USM_Instr_Mix_fp32_device_1:1mix_no_init_no_prefetch 1.817000 ms
USM_Instr_Mix_fp32_host_1:1mix_no_init_no_prefetch 1.215000 ms
VectorAddition_int32 1.484000 ms
VectorAddition_int64 3.096000 ms
VectorAddition_fp32 1.510000 ms
Polybench_2mm 1.255000 ms
Polybench_3mm 1.798000 ms
Polybench_Atax 7.037000 ms
Kmeans_fp32 16.116000 ms
LinearRegressionCoeff_fp32 1108.002000 ms
MolecularDynamics 0.031000 ms
llama.cpp bench
Relative perf in group Other (6)
Benchmark This PR
llama.cpp Prompt Processing Batched 128 884.411025 token/s
llama.cpp Text Generation Batched 128 60.936898 token/s
llama.cpp Prompt Processing Batched 256 932.270033 token/s
llama.cpp Text Generation Batched 256 60.938115 token/s
llama.cpp Prompt Processing Batched 512 540.706265 token/s
llama.cpp Text Generation Batched 512 60.940467 token/s

Details

Benchmark details - environment, command...
api_overhead_benchmark_l0 SubmitKernel out of order

Command:

/home/test-user/llvm_bench_workdir/compute-benchmarks-build/bin/api_overhead_benchmark_l0 --test=SubmitKernel --csv --noHeaders --Ioq=0 --DiscardEvents=0 --MeasureCompletion=0 --iterations=100000 --Profiling=0 --NumKernels=10 --KernelExecTime=1

api_overhead_benchmark_l0 SubmitKernel in order

Command:

/home/test-user/llvm_bench_workdir/compute-benchmarks-build/bin/api_overhead_benchmark_l0 --test=SubmitKernel --csv --noHeaders --Ioq=1 --DiscardEvents=0 --MeasureCompletion=0 --iterations=100000 --Profiling=0 --NumKernels=10 --KernelExecTime=1

api_overhead_benchmark_sycl SubmitKernel out of order

Command:

/home/test-user/llvm_bench_workdir/compute-benchmarks-build/bin/api_overhead_benchmark_sycl --test=SubmitKernel --csv --noHeaders --Ioq=0 --DiscardEvents=0 --MeasureCompletion=0 --iterations=100000 --Profiling=0 --NumKernels=10 --KernelExecTime=1

api_overhead_benchmark_sycl SubmitKernel in order

Command:

/home/test-user/llvm_bench_workdir/compute-benchmarks-build/bin/api_overhead_benchmark_sycl --test=SubmitKernel --csv --noHeaders --Ioq=1 --DiscardEvents=0 --MeasureCompletion=0 --iterations=100000 --Profiling=0 --NumKernels=10 --KernelExecTime=1

memory_benchmark_sycl QueueInOrderMemcpy from Device to Device, size 1024

Command:

/home/test-user/llvm_bench_workdir/compute-benchmarks-build/bin/memory_benchmark_sycl --test=QueueInOrderMemcpy --csv --noHeaders --iterations=10000 --IsCopyOnly=0 --sourcePlacement=Device --destinationPlacement=Device --size=1024 --count=100

memory_benchmark_sycl QueueInOrderMemcpy from Host to Device, size 1024

Command:

/home/test-user/llvm_bench_workdir/compute-benchmarks-build/bin/memory_benchmark_sycl --test=QueueInOrderMemcpy --csv --noHeaders --iterations=10000 --IsCopyOnly=0 --sourcePlacement=Host --destinationPlacement=Device --size=1024 --count=100

memory_benchmark_sycl QueueMemcpy from Device to Device, size 1024

Command:

/home/test-user/llvm_bench_workdir/compute-benchmarks-build/bin/memory_benchmark_sycl --test=QueueMemcpy --csv --noHeaders --iterations=10000 --sourcePlacement=Device --destinationPlacement=Device --size=1024

api_overhead_benchmark_sycl ExecImmediateCopyQueue out of order from Device to Device, size 1024

Command:

/home/test-user/llvm_bench_workdir/compute-benchmarks-build/bin/api_overhead_benchmark_sycl --test=ExecImmediateCopyQueue --csv --noHeaders --iterations=100000 --ioq=0 --IsCopyOnly=1 --MeasureCompletionTime=0 --src=Device --dst=Device --size=1024

api_overhead_benchmark_sycl ExecImmediateCopyQueue in order from Device to Host, size 1024

Command:

/home/test-user/llvm_bench_workdir/compute-benchmarks-build/bin/api_overhead_benchmark_sycl --test=ExecImmediateCopyQueue --csv --noHeaders --iterations=100000 --ioq=1 --IsCopyOnly=1 --MeasureCompletionTime=0 --src=Host --dst=Host --size=1024

miscellaneous_benchmark_sycl VectorSum

Command:

/home/test-user/llvm_bench_workdir/compute-benchmarks-build/bin/miscellaneous_benchmark_sycl --test=VectorSum --csv --noHeaders --iterations=1000 --numberOfElementsX=512 --numberOfElementsY=256 --numberOfElementsZ=256

graph_api_benchmark_sycl SinKernelGraph graphs:0, numKernels:5

Command:

/home/test-user/llvm_bench_workdir/compute-benchmarks-build/bin/graph_api_benchmark_sycl --test=SinKernelGraph --csv --noHeaders --iterations=10000 --numKernels=5 --withGraphs=0 --withCopyOffload=1 --immediateAppendCmdList=0

graph_api_benchmark_sycl SinKernelGraph graphs:0, numKernels:100

Command:

/home/test-user/llvm_bench_workdir/compute-benchmarks-build/bin/graph_api_benchmark_sycl --test=SinKernelGraph --csv --noHeaders --iterations=10000 --numKernels=100 --withGraphs=0 --withCopyOffload=1 --immediateAppendCmdList=0

graph_api_benchmark_l0 SinKernelGraph graphs:0, numKernels:5

Command:

/home/test-user/llvm_bench_workdir/compute-benchmarks-build/bin/graph_api_benchmark_l0 --test=SinKernelGraph --csv --noHeaders --iterations=10000 --numKernels=5 --withGraphs=0 --withCopyOffload=1 --immediateAppendCmdList=0

graph_api_benchmark_l0 SinKernelGraph graphs:1, numKernels:5

Command:

/home/test-user/llvm_bench_workdir/compute-benchmarks-build/bin/graph_api_benchmark_l0 --test=SinKernelGraph --csv --noHeaders --iterations=10000 --numKernels=5 --withGraphs=1 --withCopyOffload=1 --immediateAppendCmdList=0

graph_api_benchmark_l0 SinKernelGraph graphs:0, numKernels:100

Command:

/home/test-user/llvm_bench_workdir/compute-benchmarks-build/bin/graph_api_benchmark_l0 --test=SinKernelGraph --csv --noHeaders --iterations=10000 --numKernels=100 --withGraphs=0 --withCopyOffload=1 --immediateAppendCmdList=0

graph_api_benchmark_l0 SinKernelGraph graphs:1, numKernels:100

Command:

/home/test-user/llvm_bench_workdir/compute-benchmarks-build/bin/graph_api_benchmark_l0 --test=SinKernelGraph --csv --noHeaders --iterations=10000 --numKernels=100 --withGraphs=1 --withCopyOffload=1 --immediateAppendCmdList=0

Velocity-Bench Hashtable

Command:

/home/test-user/llvm_bench_workdir/hashtable/hashtable_sycl --no-verify

Velocity-Bench Bitcracker

Command:

/home/test-user/llvm_bench_workdir/bitcracker/bitcracker -f /home/test-user/llvm_bench_workdir/velocity-bench-repo/bitcracker/hash_pass/img_win8_user_hash.txt -d /home/test-user/llvm_bench_workdir/velocity-bench-repo/bitcracker/hash_pass/user_passwords_60000.txt -b 60000

Velocity-Bench CudaSift

Command:

/home/test-user/llvm_bench_workdir/cudaSift/cudaSift

Velocity-Bench QuickSilver

Command:

/home/test-user/llvm_bench_workdir/QuickSilver/qs -i /home/test-user/llvm_bench_workdir/velocity-bench-repo/QuickSilver/Examples/AllScattering/scatteringOnly.inp

Environment Variables:

QS_DEVICE=GPU

Velocity-Bench Sobel Filter

Command:

/home/test-user/llvm_bench_workdir/sobel_filter/sobel_filter -i /home/test-user/llvm_bench_workdir/data/sobel_filter/sobel_filter_data/silverfalls_32Kx32K.png -n 5

Environment Variables:

OPENCV_IO_MAX_IMAGE_PIXELS=1677721600

Velocity-Bench dl-cifar

Command:

/home/test-user/llvm_bench_workdir/dl-cifar/dl-cifar_sycl

Velocity-Bench dl-mnist

Command:

/home/test-user/llvm_bench_workdir/dl-mnist/dl-mnist-sycl -conv_algo ONEDNN_AUTO

Environment Variables:

NEOReadDebugKeys=1
DisableScratchPages=0

Velocity-Bench svm

Command:

/home/test-user/llvm_bench_workdir/svm/svm_sycl /home/test-user/llvm_bench_workdir/velocity-bench-repo/svm/SYCL/a9a /home/test-user/llvm_bench_workdir/velocity-bench-repo/svm/SYCL/a.m

Runtime_IndependentDAGTaskThroughput_SingleTask

Command:

/home/test-user/llvm_bench_workdir/sycl-bench-build/dag_task_throughput_independent --warmup-run --num-runs=3 --output=/home/test-user/llvm_bench_workdir/IndependentDAGTaskThroughput_multi.csv --size=32768

Runtime_IndependentDAGTaskThroughput_BasicParallelFor

Command:

/home/test-user/llvm_bench_workdir/sycl-bench-build/dag_task_throughput_independent --warmup-run --num-runs=3 --output=/home/test-user/llvm_bench_workdir/IndependentDAGTaskThroughput_multi.csv --size=32768

Runtime_IndependentDAGTaskThroughput_HierarchicalParallelFor

Command:

/home/test-user/llvm_bench_workdir/sycl-bench-build/dag_task_throughput_independent --warmup-run --num-runs=3 --output=/home/test-user/llvm_bench_workdir/IndependentDAGTaskThroughput_multi.csv --size=32768

Runtime_IndependentDAGTaskThroughput_NDRangeParallelFor

Command:

/home/test-user/llvm_bench_workdir/sycl-bench-build/dag_task_throughput_independent --warmup-run --num-runs=3 --output=/home/test-user/llvm_bench_workdir/IndependentDAGTaskThroughput_multi.csv --size=32768

Runtime_DAGTaskThroughput_SingleTask

Command:

/home/test-user/llvm_bench_workdir/sycl-bench-build/dag_task_throughput_sequential --warmup-run --num-runs=3 --output=/home/test-user/llvm_bench_workdir/DAGTaskThroughput_multi.csv --size=327680

Runtime_DAGTaskThroughput_BasicParallelFor

Command:

/home/test-user/llvm_bench_workdir/sycl-bench-build/dag_task_throughput_sequential --warmup-run --num-runs=3 --output=/home/test-user/llvm_bench_workdir/DAGTaskThroughput_multi.csv --size=327680

Runtime_DAGTaskThroughput_HierarchicalParallelFor

Command:

/home/test-user/llvm_bench_workdir/sycl-bench-build/dag_task_throughput_sequential --warmup-run --num-runs=3 --output=/home/test-user/llvm_bench_workdir/DAGTaskThroughput_multi.csv --size=327680

Runtime_DAGTaskThroughput_NDRangeParallelFor

Command:

/home/test-user/llvm_bench_workdir/sycl-bench-build/dag_task_throughput_sequential --warmup-run --num-runs=3 --output=/home/test-user/llvm_bench_workdir/DAGTaskThroughput_multi.csv --size=327680

MicroBench_HostDeviceBandwidth_1D_H2D_Contiguous

Command:

/home/test-user/llvm_bench_workdir/sycl-bench-build/host_device_bandwidth --warmup-run --num-runs=3 --output=/home/test-user/llvm_bench_workdir/HostDeviceBandwidth_multi.csv --size=512

MicroBench_HostDeviceBandwidth_2D_H2D_Contiguous

Command:

/home/test-user/llvm_bench_workdir/sycl-bench-build/host_device_bandwidth --warmup-run --num-runs=3 --output=/home/test-user/llvm_bench_workdir/HostDeviceBandwidth_multi.csv --size=512

MicroBench_HostDeviceBandwidth_3D_H2D_Contiguous

Command:

/home/test-user/llvm_bench_workdir/sycl-bench-build/host_device_bandwidth --warmup-run --num-runs=3 --output=/home/test-user/llvm_bench_workdir/HostDeviceBandwidth_multi.csv --size=512

MicroBench_HostDeviceBandwidth_1D_D2H_Contiguous

Command:

/home/test-user/llvm_bench_workdir/sycl-bench-build/host_device_bandwidth --warmup-run --num-runs=3 --output=/home/test-user/llvm_bench_workdir/HostDeviceBandwidth_multi.csv --size=512

MicroBench_HostDeviceBandwidth_2D_D2H_Contiguous

Command:

/home/test-user/llvm_bench_workdir/sycl-bench-build/host_device_bandwidth --warmup-run --num-runs=3 --output=/home/test-user/llvm_bench_workdir/HostDeviceBandwidth_multi.csv --size=512

MicroBench_HostDeviceBandwidth_3D_D2H_Contiguous

Command:

/home/test-user/llvm_bench_workdir/sycl-bench-build/host_device_bandwidth --warmup-run --num-runs=3 --output=/home/test-user/llvm_bench_workdir/HostDeviceBandwidth_multi.csv --size=512

MicroBench_HostDeviceBandwidth_1D_H2D_Strided

Command:

/home/test-user/llvm_bench_workdir/sycl-bench-build/host_device_bandwidth --warmup-run --num-runs=3 --output=/home/test-user/llvm_bench_workdir/HostDeviceBandwidth_multi.csv --size=512

MicroBench_HostDeviceBandwidth_2D_H2D_Strided

Command:

/home/test-user/llvm_bench_workdir/sycl-bench-build/host_device_bandwidth --warmup-run --num-runs=3 --output=/home/test-user/llvm_bench_workdir/HostDeviceBandwidth_multi.csv --size=512

MicroBench_HostDeviceBandwidth_3D_H2D_Strided

Command:

/home/test-user/llvm_bench_workdir/sycl-bench-build/host_device_bandwidth --warmup-run --num-runs=3 --output=/home/test-user/llvm_bench_workdir/HostDeviceBandwidth_multi.csv --size=512

MicroBench_HostDeviceBandwidth_1D_D2H_Strided

Command:

/home/test-user/llvm_bench_workdir/sycl-bench-build/host_device_bandwidth --warmup-run --num-runs=3 --output=/home/test-user/llvm_bench_workdir/HostDeviceBandwidth_multi.csv --size=512

MicroBench_HostDeviceBandwidth_2D_D2H_Strided

Command:

/home/test-user/llvm_bench_workdir/sycl-bench-build/host_device_bandwidth --warmup-run --num-runs=3 --output=/home/test-user/llvm_bench_workdir/HostDeviceBandwidth_multi.csv --size=512

MicroBench_HostDeviceBandwidth_3D_D2H_Strided

Command:

/home/test-user/llvm_bench_workdir/sycl-bench-build/host_device_bandwidth --warmup-run --num-runs=3 --output=/home/test-user/llvm_bench_workdir/HostDeviceBandwidth_multi.csv --size=512

MicroBench_LocalMem_int32_4096

Command:

/home/test-user/llvm_bench_workdir/sycl-bench-build/local_mem --warmup-run --num-runs=3 --output=/home/test-user/llvm_bench_workdir/LocalMem_multi.csv --size=10240000

MicroBench_LocalMem_fp32_4096

Command:

/home/test-user/llvm_bench_workdir/sycl-bench-build/local_mem --warmup-run --num-runs=3 --output=/home/test-user/llvm_bench_workdir/LocalMem_multi.csv --size=10240000

Pattern_Reduction_NDRange_int32

Command:

/home/test-user/llvm_bench_workdir/sycl-bench-build/reduction --warmup-run --num-runs=3 --output=/home/test-user/llvm_bench_workdir/Pattern_Reduction_multi.csv --size=10240000

Pattern_Reduction_Hierarchical_int32

Command:

/home/test-user/llvm_bench_workdir/sycl-bench-build/reduction --warmup-run --num-runs=3 --output=/home/test-user/llvm_bench_workdir/Pattern_Reduction_multi.csv --size=10240000

ScalarProduct_NDRange_int32

Command:

/home/test-user/llvm_bench_workdir/sycl-bench-build/scalar_prod --warmup-run --num-runs=3 --output=/home/test-user/llvm_bench_workdir/ScalarProduct_multi.csv --size=102400000

ScalarProduct_NDRange_int64

Command:

/home/test-user/llvm_bench_workdir/sycl-bench-build/scalar_prod --warmup-run --num-runs=3 --output=/home/test-user/llvm_bench_workdir/ScalarProduct_multi.csv --size=102400000

ScalarProduct_NDRange_fp32

Command:

/home/test-user/llvm_bench_workdir/sycl-bench-build/scalar_prod --warmup-run --num-runs=3 --output=/home/test-user/llvm_bench_workdir/ScalarProduct_multi.csv --size=102400000

ScalarProduct_Hierarchical_int32

Command:

/home/test-user/llvm_bench_workdir/sycl-bench-build/scalar_prod --warmup-run --num-runs=3 --output=/home/test-user/llvm_bench_workdir/ScalarProduct_multi.csv --size=102400000

ScalarProduct_Hierarchical_int64

Command:

/home/test-user/llvm_bench_workdir/sycl-bench-build/scalar_prod --warmup-run --num-runs=3 --output=/home/test-user/llvm_bench_workdir/ScalarProduct_multi.csv --size=102400000

ScalarProduct_Hierarchical_fp32

Command:

/home/test-user/llvm_bench_workdir/sycl-bench-build/scalar_prod --warmup-run --num-runs=3 --output=/home/test-user/llvm_bench_workdir/ScalarProduct_multi.csv --size=102400000

Pattern_SegmentedReduction_NDRange_int16

Command:

/home/test-user/llvm_bench_workdir/sycl-bench-build/segmentedreduction --warmup-run --num-runs=3 --output=/home/test-user/llvm_bench_workdir/Pattern_SegmentedReduction_multi.csv --size=102400000

Pattern_SegmentedReduction_NDRange_int32

Command:

/home/test-user/llvm_bench_workdir/sycl-bench-build/segmentedreduction --warmup-run --num-runs=3 --output=/home/test-user/llvm_bench_workdir/Pattern_SegmentedReduction_multi.csv --size=102400000

Pattern_SegmentedReduction_NDRange_int64

Command:

/home/test-user/llvm_bench_workdir/sycl-bench-build/segmentedreduction --warmup-run --num-runs=3 --output=/home/test-user/llvm_bench_workdir/Pattern_SegmentedReduction_multi.csv --size=102400000

Pattern_SegmentedReduction_NDRange_fp32

Command:

/home/test-user/llvm_bench_workdir/sycl-bench-build/segmentedreduction --warmup-run --num-runs=3 --output=/home/test-user/llvm_bench_workdir/Pattern_SegmentedReduction_multi.csv --size=102400000

Pattern_SegmentedReduction_Hierarchical_int16

Command:

/home/test-user/llvm_bench_workdir/sycl-bench-build/segmentedreduction --warmup-run --num-runs=3 --output=/home/test-user/llvm_bench_workdir/Pattern_SegmentedReduction_multi.csv --size=102400000

Pattern_SegmentedReduction_Hierarchical_int32

Command:

/home/test-user/llvm_bench_workdir/sycl-bench-build/segmentedreduction --warmup-run --num-runs=3 --output=/home/test-user/llvm_bench_workdir/Pattern_SegmentedReduction_multi.csv --size=102400000

Pattern_SegmentedReduction_Hierarchical_int64

Command:

/home/test-user/llvm_bench_workdir/sycl-bench-build/segmentedreduction --warmup-run --num-runs=3 --output=/home/test-user/llvm_bench_workdir/Pattern_SegmentedReduction_multi.csv --size=102400000

Pattern_SegmentedReduction_Hierarchical_fp32

Command:

/home/test-user/llvm_bench_workdir/sycl-bench-build/segmentedreduction --warmup-run --num-runs=3 --output=/home/test-user/llvm_bench_workdir/Pattern_SegmentedReduction_multi.csv --size=102400000

USM_Allocation_latency_fp32_device

Command:

/home/test-user/llvm_bench_workdir/sycl-bench-build/usm_allocation_latency --warmup-run --num-runs=3 --output=/home/test-user/llvm_bench_workdir/USM_Allocation_latency_multi.csv --size=1024000000

USM_Allocation_latency_fp32_host

Command:

/home/test-user/llvm_bench_workdir/sycl-bench-build/usm_allocation_latency --warmup-run --num-runs=3 --output=/home/test-user/llvm_bench_workdir/USM_Allocation_latency_multi.csv --size=1024000000

USM_Allocation_latency_fp32_shared

Command:

/home/test-user/llvm_bench_workdir/sycl-bench-build/usm_allocation_latency --warmup-run --num-runs=3 --output=/home/test-user/llvm_bench_workdir/USM_Allocation_latency_multi.csv --size=1024000000

USM_Instr_Mix_fp32_device_1:1mix_with_init_no_prefetch

Command:

/home/test-user/llvm_bench_workdir/sycl-bench-build/usm_instr_mix --warmup-run --num-runs=3 --output=/home/test-user/llvm_bench_workdir/USM_Instr_Mix_multi.csv --size=8192

USM_Instr_Mix_fp32_host_1:1mix_with_init_no_prefetch

Command:

/home/test-user/llvm_bench_workdir/sycl-bench-build/usm_instr_mix --warmup-run --num-runs=3 --output=/home/test-user/llvm_bench_workdir/USM_Instr_Mix_multi.csv --size=8192

USM_Instr_Mix_fp32_device_1:1mix_no_init_no_prefetch

Command:

/home/test-user/llvm_bench_workdir/sycl-bench-build/usm_instr_mix --warmup-run --num-runs=3 --output=/home/test-user/llvm_bench_workdir/USM_Instr_Mix_multi.csv --size=8192

USM_Instr_Mix_fp32_host_1:1mix_no_init_no_prefetch

Command:

/home/test-user/llvm_bench_workdir/sycl-bench-build/usm_instr_mix --warmup-run --num-runs=3 --output=/home/test-user/llvm_bench_workdir/USM_Instr_Mix_multi.csv --size=8192

VectorAddition_int32

Command:

/home/test-user/llvm_bench_workdir/sycl-bench-build/vec_add --warmup-run --num-runs=3 --output=/home/test-user/llvm_bench_workdir/VectorAddition_multi.csv --size=102400000

VectorAddition_int64

Command:

/home/test-user/llvm_bench_workdir/sycl-bench-build/vec_add --warmup-run --num-runs=3 --output=/home/test-user/llvm_bench_workdir/VectorAddition_multi.csv --size=102400000

VectorAddition_fp32

Command:

/home/test-user/llvm_bench_workdir/sycl-bench-build/vec_add --warmup-run --num-runs=3 --output=/home/test-user/llvm_bench_workdir/VectorAddition_multi.csv --size=102400000

Polybench_2mm

Command:

/home/test-user/llvm_bench_workdir/sycl-bench-build/2mm --warmup-run --num-runs=3 --output=/home/test-user/llvm_bench_workdir/2mm.csv --size=512

Polybench_3mm

Command:

/home/test-user/llvm_bench_workdir/sycl-bench-build/3mm --warmup-run --num-runs=3 --output=/home/test-user/llvm_bench_workdir/3mm.csv --size=512

Polybench_Atax

Command:

/home/test-user/llvm_bench_workdir/sycl-bench-build/atax --warmup-run --num-runs=3 --output=/home/test-user/llvm_bench_workdir/Atax.csv --size=8192

Kmeans_fp32

Command:

/home/test-user/llvm_bench_workdir/sycl-bench-build/kmeans --warmup-run --num-runs=3 --output=/home/test-user/llvm_bench_workdir/Kmeans.csv --size=700000000

LinearRegressionCoeff_fp32

Command:

/home/test-user/llvm_bench_workdir/sycl-bench-build/lin_reg_coeff --warmup-run --num-runs=3 --output=/home/test-user/llvm_bench_workdir/LinearRegressionCoeff.csv --size=1638400000

MolecularDynamics

Command:

/home/test-user/llvm_bench_workdir/sycl-bench-build/mol_dyn --warmup-run --num-runs=3 --output=/home/test-user/llvm_bench_workdir/MolecularDynamics.csv --size=8196

llama.cpp Prompt Processing Batched 128

Command:

/home/test-user/llvm_bench_workdir/llamacpp-build/bin/llama-bench --output csv -n 128 -p 512 -b 128,256,512 --numa isolate -t 56 --model /home/test-user/llvm_bench_workdir/models/Phi-3-mini-4k-instruct-q4.gguf

llama.cpp Text Generation Batched 128

Command:

/home/test-user/llvm_bench_workdir/llamacpp-build/bin/llama-bench --output csv -n 128 -p 512 -b 128,256,512 --numa isolate -t 56 --model /home/test-user/llvm_bench_workdir/models/Phi-3-mini-4k-instruct-q4.gguf

llama.cpp Prompt Processing Batched 256

Command:

/home/test-user/llvm_bench_workdir/llamacpp-build/bin/llama-bench --output csv -n 128 -p 512 -b 128,256,512 --numa isolate -t 56 --model /home/test-user/llvm_bench_workdir/models/Phi-3-mini-4k-instruct-q4.gguf

llama.cpp Text Generation Batched 256

Command:

/home/test-user/llvm_bench_workdir/llamacpp-build/bin/llama-bench --output csv -n 128 -p 512 -b 128,256,512 --numa isolate -t 56 --model /home/test-user/llvm_bench_workdir/models/Phi-3-mini-4k-instruct-q4.gguf

llama.cpp Prompt Processing Batched 512

Command:

/home/test-user/llvm_bench_workdir/llamacpp-build/bin/llama-bench --output csv -n 128 -p 512 -b 128,256,512 --numa isolate -t 56 --model /home/test-user/llvm_bench_workdir/models/Phi-3-mini-4k-instruct-q4.gguf

llama.cpp Text Generation Batched 512

Command:

/home/test-user/llvm_bench_workdir/llamacpp-build/bin/llama-bench --output csv -n 128 -p 512 -b 128,256,512 --numa isolate -t 56 --model /home/test-user/llvm_bench_workdir/models/Phi-3-mini-4k-instruct-q4.gguf

@lukaszstolarczuk
Copy link
Contributor Author

Replaced with #17468

@lukaszstolarczuk
Copy link
Contributor Author

replaced with newer version

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants