massive mips and loongarch optimization by nihui · Pull Request #6662 · Tencent/ncnn

nihui · 2026-04-09T08:56:07Z

No description provided.

tencent-adm · 2026-04-09T08:56:26Z

Thank you for your submission, we really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

codecov-commenter · 2026-04-09T09:01:43Z

Codecov Report

❌ Patch coverage is 98.04736% with 127 lines in your changes missing coverage. Please review.
✅ Project coverage is 95.93%. Comparing base (f6b75ce) to head (dc76c98).

Files with missing lines	Patch %	Lines
src/layer/loongarch/convolution_loongarch.cpp	75.40%	107 Missing ⚠️
src/layer/loongarch/binaryop_loongarch.cpp	97.87%	11 Missing ⚠️
src/layer/loongarch/convolution_packed_bf16s.h	99.75%	3 Missing ⚠️
src/layer/loongarch/convolution_packed_int8.h	98.87%	3 Missing ⚠️
src/layer/loongarch/convolution1d_loongarch.cpp	95.00%	2 Missing ⚠️
src/layer/loongarch/convolution_packed.h	99.88%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #6662      +/-   ##
==========================================
+ Coverage   95.83%   95.93%   +0.09%     
==========================================
  Files         933      966      +33     
  Lines      312469   402676   +90207     
==========================================
+ Hits       299448   386297   +86849     
- Misses      13021    16379    +3358

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Add jj+=12 loop unrolling to pack_B_tile, transpose_pack_B_tile, transpose_unpack_output_tile, and gemm_transB_packed_tile for all ii sections (8, 4, 2, 1). MIPS MSA has 32 SIMD registers so jj+=12 fits well (24 registers for ii+=8, 12 for ii+=4). Update get_optimal_tile_mnk to align TILE_N to multiples of 12 for better utilization of the new kernel. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…ngArch Integrate bf16 storage support into multiple operators: MIPS: batchnorm, clip, dropout, selu, erf LoongArch: batchnorm, clip, dropout Each operator now declares forward_inplace_bf16s in its header, sets support_bf16_storage=true in the constructor, dispatches bf16 inputs from forward_inplace, and implements the bf16s path using the existing bf16s helper headers. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

- Add support_bf16_storage = true in constructors for both architectures - Add crop_pack4_bf16s_msa() for MIPS MSA using int64_t copies (8 bytes) - Add crop_pack4_bf16s_lsx() for LoongArch LSX using int64_t copies - Add crop_pack8_lasx() for LoongArch LASX float pack8 (256-bit) - Add crop_pack8_bf16s_lsx() for LoongArch LASX bf16 pack8 (128-bit) - Dispatch to bf16 variants when elemsize matches bf16 packing - Remove debug fprintf statements from MIPS deconvolution_packed_bf16s.h Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Add interp_bilinear_pack8.h and interp_bicubic_pack8.h implementing 256-bit SIMD (8 floats) resize operations using LASX intrinsics. Update interp_loongarch.cpp to: - Include lasxintrin.h and the new pack8 headers under __loongarch_asx - Add elempack == 8 paths for dims 1, 2, and 3 (nearest, bilinear, bicubic) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

… approach - Replace hand-written kernel packing and convolution loops with convolution1d_transform_kernel_packed() and convolution1d_packed() from convolution1d_packed.h - Rename weight_data_packed to weight_data_tm to match x86 pattern - Add LASX (256-bit) support with pack8 out_elempack - Add NCNN_BF16 support using cast-based approach (bf16->fp32->conv->bf16) - Add bf16 weight/bias cast in dynamic weight forward path - Include cpu.h, lasxintrin.h headers for new functionality Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

nihui · 2026-05-05T11:06:20Z

3a4000 loongnix-20.rc2
4.19.0-12-loongson-3
gcc 8.3.0

1t	baseline	pr6662	pr6662-bf16s
squeezenet	50.39	43.98	47.47
squeezenet_int8	72.41	31.37	32.10
mobilenet	88.65	75.43	85.41
mobilenet_int8	167.50	94.52	94.64
mobilenet_v2	56.49	54.21	59.64
mobilenet_v3	49.06	44.91	45.72
shufflenet	32.81	28.13	35.56
shufflenet_v2	29.72	26.85	49.94
mnasnet	60.99	57.07	56.71
proxylessnasnet	73.91	70.10	60.89
efficientnet_b0	114.01	108.94	99.70
efficientnetv2_b0	121.99	110.28	116.13
regnety_400m	80.49	71.92	70.35
blazeface	11.11	8.19	10.66
googlenet	225.69	155.35	160.72
googlenet_int8	294.52	109.17	108.14
resnet18	148.09	127.78	136.00
resnet18_int8	211.71	86.76	88.26
alexnet	190.86	90.68	89.89
vgg16	789.75	629.14	614.15
vgg16_int8	1079.66	507.15	503.41
resnet50	425.48	349.52	382.68
resnet50_int8	581.01	223.40	226.90
squeezenet_ssd	130.99	101.96	107.98
squeezenet_ssd_int8	156.03	77.19	80.32
mobilenet_ssd	180.59	152.79	172.46
mobilenet_ssd_int8	324.44	177.15	178.49
mobilenet_yolo	439.16	350.95	418.24
mobilenetv2_yolov3	205.10	188.08	205.93
yolov4-tiny	264.84	227.00	236.27
nanodet_m	68.40	63.08	111.86
yolo-fastest-1.1	27.54	30.24	34.71
yolo-fastestv2	26.70	34.34	34.59
vision_transformer	15111.31	1574.79	1732.48
FastestDet	30.63	38.17	36.35

4t	baseline	pr6662	pr6662-bf16s
squeezenet	14.87	13.73	13.85
squeezenet_int8	21.31	12.56	12.75
mobilenet	25.42	21.07	22.06
mobilenet_int8	42.85	26.43	26.80
mobilenet_v2	17.03	16.58	16.68
mobilenet_v3	14.99	14.21	14.73
shufflenet	11.82	10.15	12.88
shufflenet_v2	10.70	9.17	17.06
mnasnet	17.39	17.18	16.66
proxylessnasnet	20.82	20.04	17.21
efficientnet_b0	32.34	31.45	28.16
efficientnetv2_b0	35.44	33.65	34.09
regnety_400m	36.17	27.22	31.53
blazeface	3.94	2.80	3.50
googlenet	65.40	47.53	46.11
googlenet_int8	79.55	36.92	37.01
resnet18	44.34	40.81	41.23
resnet18_int8	56.32	27.14	27.72
alexnet	53.63	27.65	28.60
vgg16	258.47	217.32	211.31
vgg16_int8	293.93	168.97	167.64
resnet50	124.35	103.89	105.97
resnet50_int8	154.59	70.36	70.70
squeezenet_ssd	47.87	35.84	36.16
squeezenet_ssd_int8	49.50	31.48	31.89
mobilenet_ssd	53.55	44.46	45.25
mobilenet_ssd_int8	83.69	49.60	50.63
mobilenet_yolo	159.68	110.33	138.23
mobilenetv2_yolov3	66.82	61.07	60.27
yolov4-tiny	94.52	81.87	78.96
nanodet_m	23.56	21.38	37.49
yolo-fastest-1.1	10.15	12.44	15.84
yolo-fastestv2	10.65	15.58	15.02
vision_transformer	3950.69	452.89	489.99
FastestDet	11.37	16.08	14.94

nihui · 2026-05-05T13:43:33Z

3a6000 loongnix-20
4.19.0-19-loongson-3
gcc 8.3.0

1t	baseline	pr6662	pr6662-bf16s
squeezenet	21.25	13.95	13.04
squeezenet_int8	35.65	14.07	13.57
mobilenet	37.77	23.09	27.14
mobilenet_int8	75.81	25.75	26.37
mobilenet_v2	25.06	17.41	18.97
mobilenet_v3	19.97	13.61	16.68
shufflenet	12.67	9.11	10.26
shufflenet_v2	12.24	9.90	14.48
mnasnet	25.07	17.13	19.59
proxylessnasnet	30.95	20.29	22.43
efficientnet_b0	49.33	33.70	36.41
efficientnetv2_b0	55.41	35.45	38.92
regnety_400m	33.78	21.25	22.97
blazeface	5.34	3.06	3.05
googlenet	87.11	51.60	48.50
googlenet_int8	133.64	49.07	48.45
resnet18	68.85	46.54	41.45
resnet18_int8	114.22	40.68	40.48
alexnet	96.10	28.97	30.13
vgg16	360.85	202.68	188.24
vgg16_int8	631.80	215.56	215.99
resnet50	187.97	113.83	117.90
resnet50_int8	295.43	97.92	98.30
squeezenet_ssd	62.01	37.85	37.05
squeezenet_ssd_int8	81.80	37.15	37.27
mobilenet_ssd	75.75	48.12	56.19
mobilenet_ssd_int8	147.48	50.52	52.51
mobilenet_yolo	197.19	109.40	136.56
mobilenetv2_yolov3	86.50	60.73	66.07
yolov4-tiny	117.88	76.88	71.40
nanodet_m	28.51	22.75	32.76
yolo-fastest-1.1	11.88	9.27	13.16
yolo-fastestv2	10.55	11.84	10.32
vision_transformer	4582.75	1131.90	1044.43
FastestDet	11.81	13.16	11.67

4t	baseline	pr6662	pr6662-bf16s
squeezenet	7.81	5.30	4.51
squeezenet_int8	10.22	5.90	5.24
mobilenet	12.95	7.45	7.22
mobilenet_int8	19.32	7.81	7.87
mobilenet_v2	8.57	6.65	5.61
mobilenet_v3	6.71	5.57	5.79
shufflenet	4.62	4.25	4.37
shufflenet_v2	4.40	4.56	5.67
mnasnet	7.72	5.73	6.00
proxylessnasnet	9.19	6.42	6.58
efficientnet_b0	14.82	11.66	10.63
efficientnetv2_b0	18.47	12.99	12.73
regnety_400m	14.93	11.38	12.31
blazeface	1.84	1.19	1.16
googlenet	28.37	20.03	17.86
googlenet_int8	36.93	18.92	17.77
resnet18	25.74	20.70	19.03
resnet18_int8	32.41	15.51	15.19
alexnet	31.40	14.62	14.01
vgg16	154.29	116.29	101.91
vgg16_int8	189.38	88.30	84.73
resnet50	66.72	43.77	40.29
resnet50_int8	78.62	37.57	33.78
squeezenet_ssd	28.95	19.96	17.67
squeezenet_ssd_int8	28.73	19.23	16.18
mobilenet_ssd	28.96	17.07	15.94
mobilenet_ssd_int8	38.48	16.18	17.90
mobilenet_yolo	96.66	39.68	49.54
mobilenetv2_yolov3	35.14	25.44	21.96
yolov4-tiny	56.07	37.11	32.84
nanodet_m	10.74	10.29	12.38
yolo-fastest-1.1	4.57	4.72	6.42
yolo-fastestv2	4.33	6.10	5.43
vision_transformer	1217.27	369.71	295.17
FastestDet	4.73	6.19	5.50

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: cd6b5905e4

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b1c814c823

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 1a2119f149

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

This reverts commit b885070.

massive mips and loongarch optimization

6529782

github-actions Bot added core loongarch mips labels Apr 9, 2026

opt

8a0c38d

nihui force-pushed the mips-opt3 branch from dc8fc0f to 8a0c38d Compare April 10, 2026 07:08

nihui and others added 11 commits April 10, 2026 07:10

apply code-format changes

8b2010e

wip

d1f9876

apply code-format changes

df9cac1

fix

f4bb8c7

wip

cccdcc2

wip

5e68a2f

nihui force-pushed the mips-opt3 branch from 6bbdc54 to 5e68a2f Compare April 15, 2026 02:33

github-actions Bot added the test label Apr 15, 2026

nihui and others added 3 commits April 15, 2026 02:35

apply code-format changes

19b564e

cc

e5b89af

cc

1d498d6

nihui force-pushed the mips-opt3 branch from 5431c84 to 1d498d6 Compare April 15, 2026 06:31

cc

bc43dcc

nihui force-pushed the mips-opt3 branch from 0720de1 to bc43dcc Compare April 15, 2026 07:30

nihui and others added 3 commits April 15, 2026 07:32

apply code-format changes

7ef3ae0

fix bias

d84a773

cc

39a8f45

nihui and others added 4 commits May 5, 2026 08:36

apply code-format changes

1eb9cce

preload++

ce9113f

memcpy--

3895d7a

opt

bda1ba0

nihui added 3 commits May 5, 2026 20:53

opt ip

910b42d

pld

0246bf8

opt

3c7b3a0

nihui and others added 6 commits May 5, 2026 23:06

opt

21dea17

Merge branch 'Tencent:master' into mips-opt3

2ab4f7e

Merge branch 'master' into mips-opt3

a809ad8

Merge branch 'master' into mips-opt3

feb9fb1

opt++

a5c3fc8

apply code-format changes

cd6b590

nihui closed this May 11, 2026

nihui reopened this May 11, 2026

chatgpt-codex-connector Bot reviewed May 11, 2026

View reviewed changes

Comment thread src/layer/mips/rmsnorm_mips.cpp

Comment thread src/layer/loongarch/rmsnorm_loongarch.cpp

Merge branch 'master' into mips-opt3

b1c814c

chatgpt-codex-connector Bot reviewed May 12, 2026

View reviewed changes

Comment thread src/layer/mips/layernorm_mips.cpp

Comment thread src/layer/loongarch/layernorm_loongarch.cpp

Merge branch 'master' into mips-opt3

1a2119f

chatgpt-codex-connector Bot reviewed May 13, 2026

View reviewed changes

Comment thread src/layer/loongarch/layernorm_loongarch.cpp

Comment thread src/layer/mips/rmsnorm_mips.cpp

nihui added 8 commits May 13, 2026 19:12

opt

4694a6c

unroll kk

ac91091

re

434195c

w

b885070

Revert "w"

95873b8

This reverts commit b885070.

ww

572866d

f

8fb7334

Merge branch 'master' into mips-opt3

dc76c98

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

massive mips and loongarch optimization#6662

massive mips and loongarch optimization#6662
nihui wants to merge 171 commits into
Tencent:masterfrom
nihui:mips-opt3

nihui commented Apr 9, 2026 •

edited

Loading

Uh oh!

tencent-adm commented Apr 9, 2026

Uh oh!

codecov-commenter commented Apr 9, 2026 •

edited

Loading

Uh oh!

nihui commented May 5, 2026 •

edited

Loading

Uh oh!

nihui commented May 5, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

nihui commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tencent-adm commented Apr 9, 2026

Uh oh!

codecov-commenter commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

nihui commented May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nihui commented May 5, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

nihui commented Apr 9, 2026 •

edited

Loading

codecov-commenter commented Apr 9, 2026 •

edited

Loading

nihui commented May 5, 2026 •

edited

Loading