Skip to content

massive mips and loongarch optimization#6662

Open
nihui wants to merge 171 commits into
Tencent:masterfrom
nihui:mips-opt3
Open

massive mips and loongarch optimization#6662
nihui wants to merge 171 commits into
Tencent:masterfrom
nihui:mips-opt3

Conversation

@nihui
Copy link
Copy Markdown
Member

@nihui nihui commented Apr 9, 2026

No description provided.

@tencent-adm
Copy link
Copy Markdown
Member

CLA assistant check
Thank you for your submission, we really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Apr 9, 2026

Codecov Report

❌ Patch coverage is 98.04736% with 127 lines in your changes missing coverage. Please review.
✅ Project coverage is 95.93%. Comparing base (f6b75ce) to head (dc76c98).

Files with missing lines Patch % Lines
src/layer/loongarch/convolution_loongarch.cpp 75.40% 107 Missing ⚠️
src/layer/loongarch/binaryop_loongarch.cpp 97.87% 11 Missing ⚠️
src/layer/loongarch/convolution_packed_bf16s.h 99.75% 3 Missing ⚠️
src/layer/loongarch/convolution_packed_int8.h 98.87% 3 Missing ⚠️
src/layer/loongarch/convolution1d_loongarch.cpp 95.00% 2 Missing ⚠️
src/layer/loongarch/convolution_packed.h 99.88% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #6662      +/-   ##
==========================================
+ Coverage   95.83%   95.93%   +0.09%     
==========================================
  Files         933      966      +33     
  Lines      312469   402676   +90207     
==========================================
+ Hits       299448   386297   +86849     
- Misses      13021    16379    +3358     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

nihui and others added 11 commits April 10, 2026 07:10
Add jj+=12 loop unrolling to pack_B_tile, transpose_pack_B_tile,
transpose_unpack_output_tile, and gemm_transB_packed_tile for all
ii sections (8, 4, 2, 1). MIPS MSA has 32 SIMD registers so
jj+=12 fits well (24 registers for ii+=8, 12 for ii+=4).

Update get_optimal_tile_mnk to align TILE_N to multiples of 12
for better utilization of the new kernel.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ngArch

Integrate bf16 storage support into multiple operators:

MIPS: batchnorm, clip, dropout, selu, erf
LoongArch: batchnorm, clip, dropout

Each operator now declares forward_inplace_bf16s in its header,
sets support_bf16_storage=true in the constructor, dispatches bf16
inputs from forward_inplace, and implements the bf16s path using
the existing bf16s helper headers.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Add support_bf16_storage = true in constructors for both architectures
- Add crop_pack4_bf16s_msa() for MIPS MSA using int64_t copies (8 bytes)
- Add crop_pack4_bf16s_lsx() for LoongArch LSX using int64_t copies
- Add crop_pack8_lasx() for LoongArch LASX float pack8 (256-bit)
- Add crop_pack8_bf16s_lsx() for LoongArch LASX bf16 pack8 (128-bit)
- Dispatch to bf16 variants when elemsize matches bf16 packing
- Remove debug fprintf statements from MIPS deconvolution_packed_bf16s.h

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add interp_bilinear_pack8.h and interp_bicubic_pack8.h implementing
256-bit SIMD (8 floats) resize operations using LASX intrinsics.

Update interp_loongarch.cpp to:
- Include lasxintrin.h and the new pack8 headers under __loongarch_asx
- Add elempack == 8 paths for dims 1, 2, and 3 (nearest, bilinear, bicubic)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
… approach

- Replace hand-written kernel packing and convolution loops with
  convolution1d_transform_kernel_packed() and convolution1d_packed()
  from convolution1d_packed.h
- Rename weight_data_packed to weight_data_tm to match x86 pattern
- Add LASX (256-bit) support with pack8 out_elempack
- Add NCNN_BF16 support using cast-based approach (bf16->fp32->conv->bf16)
- Add bf16 weight/bias cast in dynamic weight forward path
- Include cpu.h, lasxintrin.h headers for new functionality

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@nihui
Copy link
Copy Markdown
Member Author

nihui commented May 5, 2026

3a4000 loongnix-20.rc2
4.19.0-12-loongson-3
gcc 8.3.0

1t baseline pr6662 pr6662-bf16s
squeezenet 50.39 43.98 47.47
squeezenet_int8 72.41 31.37 32.10
mobilenet 88.65 75.43 85.41
mobilenet_int8 167.50 94.52 94.64
mobilenet_v2 56.49 54.21 59.64
mobilenet_v3 49.06 44.91 45.72
shufflenet 32.81 28.13 35.56
shufflenet_v2 29.72 26.85 49.94
mnasnet 60.99 57.07 56.71
proxylessnasnet 73.91 70.10 60.89
efficientnet_b0 114.01 108.94 99.70
efficientnetv2_b0 121.99 110.28 116.13
regnety_400m 80.49 71.92 70.35
blazeface 11.11 8.19 10.66
googlenet 225.69 155.35 160.72
googlenet_int8 294.52 109.17 108.14
resnet18 148.09 127.78 136.00
resnet18_int8 211.71 86.76 88.26
alexnet 190.86 90.68 89.89
vgg16 789.75 629.14 614.15
vgg16_int8 1079.66 507.15 503.41
resnet50 425.48 349.52 382.68
resnet50_int8 581.01 223.40 226.90
squeezenet_ssd 130.99 101.96 107.98
squeezenet_ssd_int8 156.03 77.19 80.32
mobilenet_ssd 180.59 152.79 172.46
mobilenet_ssd_int8 324.44 177.15 178.49
mobilenet_yolo 439.16 350.95 418.24
mobilenetv2_yolov3 205.10 188.08 205.93
yolov4-tiny 264.84 227.00 236.27
nanodet_m 68.40 63.08 111.86
yolo-fastest-1.1 27.54 30.24 34.71
yolo-fastestv2 26.70 34.34 34.59
vision_transformer 15111.31 1574.79 1732.48
FastestDet 30.63 38.17 36.35
4t baseline pr6662 pr6662-bf16s
squeezenet 14.87 13.73 13.85
squeezenet_int8 21.31 12.56 12.75
mobilenet 25.42 21.07 22.06
mobilenet_int8 42.85 26.43 26.80
mobilenet_v2 17.03 16.58 16.68
mobilenet_v3 14.99 14.21 14.73
shufflenet 11.82 10.15 12.88
shufflenet_v2 10.70 9.17 17.06
mnasnet 17.39 17.18 16.66
proxylessnasnet 20.82 20.04 17.21
efficientnet_b0 32.34 31.45 28.16
efficientnetv2_b0 35.44 33.65 34.09
regnety_400m 36.17 27.22 31.53
blazeface 3.94 2.80 3.50
googlenet 65.40 47.53 46.11
googlenet_int8 79.55 36.92 37.01
resnet18 44.34 40.81 41.23
resnet18_int8 56.32 27.14 27.72
alexnet 53.63 27.65 28.60
vgg16 258.47 217.32 211.31
vgg16_int8 293.93 168.97 167.64
resnet50 124.35 103.89 105.97
resnet50_int8 154.59 70.36 70.70
squeezenet_ssd 47.87 35.84 36.16
squeezenet_ssd_int8 49.50 31.48 31.89
mobilenet_ssd 53.55 44.46 45.25
mobilenet_ssd_int8 83.69 49.60 50.63
mobilenet_yolo 159.68 110.33 138.23
mobilenetv2_yolov3 66.82 61.07 60.27
yolov4-tiny 94.52 81.87 78.96
nanodet_m 23.56 21.38 37.49
yolo-fastest-1.1 10.15 12.44 15.84
yolo-fastestv2 10.65 15.58 15.02
vision_transformer 3950.69 452.89 489.99
FastestDet 11.37 16.08 14.94

@nihui
Copy link
Copy Markdown
Member Author

nihui commented May 5, 2026

3a6000 loongnix-20
4.19.0-19-loongson-3
gcc 8.3.0

1t baseline pr6662 pr6662-bf16s
squeezenet 21.25 13.95 13.04
squeezenet_int8 35.65 14.07 13.57
mobilenet 37.77 23.09 27.14
mobilenet_int8 75.81 25.75 26.37
mobilenet_v2 25.06 17.41 18.97
mobilenet_v3 19.97 13.61 16.68
shufflenet 12.67 9.11 10.26
shufflenet_v2 12.24 9.90 14.48
mnasnet 25.07 17.13 19.59
proxylessnasnet 30.95 20.29 22.43
efficientnet_b0 49.33 33.70 36.41
efficientnetv2_b0 55.41 35.45 38.92
regnety_400m 33.78 21.25 22.97
blazeface 5.34 3.06 3.05
googlenet 87.11 51.60 48.50
googlenet_int8 133.64 49.07 48.45
resnet18 68.85 46.54 41.45
resnet18_int8 114.22 40.68 40.48
alexnet 96.10 28.97 30.13
vgg16 360.85 202.68 188.24
vgg16_int8 631.80 215.56 215.99
resnet50 187.97 113.83 117.90
resnet50_int8 295.43 97.92 98.30
squeezenet_ssd 62.01 37.85 37.05
squeezenet_ssd_int8 81.80 37.15 37.27
mobilenet_ssd 75.75 48.12 56.19
mobilenet_ssd_int8 147.48 50.52 52.51
mobilenet_yolo 197.19 109.40 136.56
mobilenetv2_yolov3 86.50 60.73 66.07
yolov4-tiny 117.88 76.88 71.40
nanodet_m 28.51 22.75 32.76
yolo-fastest-1.1 11.88 9.27 13.16
yolo-fastestv2 10.55 11.84 10.32
vision_transformer 4582.75 1131.90 1044.43
FastestDet 11.81 13.16 11.67
4t baseline pr6662 pr6662-bf16s
squeezenet 7.81 5.30 4.51
squeezenet_int8 10.22 5.90 5.24
mobilenet 12.95 7.45 7.22
mobilenet_int8 19.32 7.81 7.87
mobilenet_v2 8.57 6.65 5.61
mobilenet_v3 6.71 5.57 5.79
shufflenet 4.62 4.25 4.37
shufflenet_v2 4.40 4.56 5.67
mnasnet 7.72 5.73 6.00
proxylessnasnet 9.19 6.42 6.58
efficientnet_b0 14.82 11.66 10.63
efficientnetv2_b0 18.47 12.99 12.73
regnety_400m 14.93 11.38 12.31
blazeface 1.84 1.19 1.16
googlenet 28.37 20.03 17.86
googlenet_int8 36.93 18.92 17.77
resnet18 25.74 20.70 19.03
resnet18_int8 32.41 15.51 15.19
alexnet 31.40 14.62 14.01
vgg16 154.29 116.29 101.91
vgg16_int8 189.38 88.30 84.73
resnet50 66.72 43.77 40.29
resnet50_int8 78.62 37.57 33.78
squeezenet_ssd 28.95 19.96 17.67
squeezenet_ssd_int8 28.73 19.23 16.18
mobilenet_ssd 28.96 17.07 15.94
mobilenet_ssd_int8 38.48 16.18 17.90
mobilenet_yolo 96.66 39.68 49.54
mobilenetv2_yolov3 35.14 25.44 21.96
yolov4-tiny 56.07 37.11 32.84
nanodet_m 10.74 10.29 12.38
yolo-fastest-1.1 4.57 4.72 6.42
yolo-fastestv2 4.33 6.10 5.43
vision_transformer 1217.27 369.71 295.17
FastestDet 4.73 6.19 5.50

@nihui nihui closed this May 11, 2026
@nihui nihui reopened this May 11, 2026
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: cd6b5905e4

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread src/layer/mips/rmsnorm_mips.cpp
Comment thread src/layer/loongarch/rmsnorm_loongarch.cpp
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b1c814c823

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/layer/mips/layernorm_mips.cpp
Comment thread src/layer/loongarch/layernorm_loongarch.cpp
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 1a2119f149

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/layer/loongarch/layernorm_loongarch.cpp
Comment thread src/layer/mips/rmsnorm_mips.cpp
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants