[libc++] Optimize ranges::{for_each, for_each_n} for segmented iterators #132896

winner245 · 2025-03-25T07:46:38Z

Previously, the segmented iterator optimization was limited to std::{for_each, for_each_n}. This patch aims to extend the optimization to std::ranges::for_each and std::ranges::for_each_n, ensuring consistent optimizations across these algorithms. This patch first generalizes the std algorithms by introducing a Projection parameter, which is set to __identity for the std algorithms. Then we let the ranges algorithms to directly call their std counterparts with a general __proj argument. Benchmarks demonstrate performance improvements of up to 21.3x for std::deque::iterator and 24.9x for join_view of vector<vector<char>>.

Addresses a subtask of #102817.

Summary of speedups for `deque` iterators

-------------------------------------------------------------------------------
Benchmark                        deque<char>    deque<short>    deque<int>
-------------------------------------------------------------------------------
rng::for_each                       13.1x          21.3x           4.4x
rng::for_each_n                     13.8x          15.5x           3.6x
-------------------------------------------------------------------------------

Summary of speedups for `join_view` iterators

-----------------------------------------------------------------------------------------
Benchmark          vector<vector<char>>    vector<vector<short>>    vector<vector<int>>
-----------------------------------------------------------------------------------------
rng::for_each             17.8x                   22.1x                    4.3x
rng::for_each_n           24.9x                   23.1x                    4.0x
-----------------------------------------------------------------------------------------

Benchmarks:

`{std, ranges}::for_each_n` with `deque` iterators

--------------------------------------------------------------------------
Benchmark                                    Before       After    Speedup
--------------------------------------------------------------------------
std::for_each_n(vector<char>)/8             4.26 ns     4.23 ns      1.0x
std::for_each_n(vector<char>)/32            2.68 ns     2.67 ns      1.0x
std::for_each_n(vector<char>)/50            9.49 ns     9.36 ns      1.0x
std::for_each_n(vector<char>)/1024          42.3 ns     40.1 ns      1.1x
std::for_each_n(vector<char>)/4096           163 ns      151 ns      1.1x
std::for_each_n(vector<char>)/8192           308 ns      294 ns      1.0x
std::for_each_n(vector<char>)/16384          608 ns      593 ns      1.0x
std::for_each_n(vector<char>)/65536         2435 ns     2464 ns      1.0x
std::for_each_n(vector<char>)/262144       10029 ns    10190 ns      1.0x
std::for_each_n(deque<char>)/8              6.57 ns     2.43 ns      2.7x
std::for_each_n(deque<char>)/32             24.0 ns     2.73 ns      8.8x
std::for_each_n(deque<char>)/50             33.2 ns     4.53 ns      7.3x
std::for_each_n(deque<char>)/1024            541 ns     44.9 ns     12.0x
std::for_each_n(deque<char>)/4096           2067 ns      169 ns     12.2x
std::for_each_n(deque<char>)/8192           4005 ns      305 ns     13.1x
std::for_each_n(deque<char>)/16384          7831 ns      639 ns     12.3x
std::for_each_n(deque<char>)/65536         31819 ns     2717 ns     11.7x
std::for_each_n(deque<char>)/262144       120801 ns    10674 ns     11.3x
std::for_each_n(list<char>)/8               4.97 ns     5.16 ns      1.0x
std::for_each_n(list<char>)/32              19.9 ns     20.6 ns      1.0x
std::for_each_n(list<char>)/50              40.6 ns     42.7 ns      1.0x
std::for_each_n(list<char>)/1024             996 ns     1038 ns      1.0x
std::for_each_n(list<char>)/4096            6186 ns     6341 ns      1.0x
std::for_each_n(list<char>)/8192           12522 ns    12391 ns      1.0x
std::for_each_n(list<char>)/16384          26158 ns    25739 ns      1.0x
std::for_each_n(list<char>)/65536         106410 ns   105299 ns      1.0x
std::for_each_n(list<char>)/262144        621473 ns   625741 ns      1.0x
rng::for_each_n(vector<char>)/8             3.85 ns     4.99 ns      0.8x
rng::for_each_n(vector<char>)/32            2.75 ns     2.91 ns      0.9x
rng::for_each_n(vector<char>)/50            9.67 ns     13.3 ns      0.7x
rng::for_each_n(vector<char>)/1024          41.4 ns     42.4 ns      1.0x
rng::for_each_n(vector<char>)/4096           154 ns      171 ns      0.9x
rng::for_each_n(vector<char>)/8192           308 ns      340 ns      0.9x
rng::for_each_n(vector<char>)/16384          608 ns      673 ns      0.9x
rng::for_each_n(vector<char>)/65536         2471 ns     2867 ns      0.9x
rng::for_each_n(vector<char>)/262144       10138 ns    10882 ns      0.9x
rng::for_each_n(deque<char>)/8              5.71 ns     2.32 ns      2.5x
rng::for_each_n(deque<char>)/32             24.0 ns     2.74 ns      8.8x
rng::for_each_n(deque<char>)/50             33.3 ns     5.00 ns      6.7x
rng::for_each_n(deque<char>)/1024            554 ns     42.1 ns     13.2x
rng::for_each_n(deque<char>)/4096           2194 ns      159 ns     13.8x
rng::for_each_n(deque<char>)/8192           4265 ns      337 ns     12.7x
rng::for_each_n(deque<char>)/16384          8539 ns      672 ns     12.7x
rng::for_each_n(deque<char>)/65536         33510 ns     2775 ns     12.1x
rng::for_each_n(deque<char>)/262144       136651 ns    11271 ns     12.1x
rng::for_each_n(list<char>)/8               5.37 ns     6.21 ns      0.9x
rng::for_each_n(list<char>)/32              20.3 ns     23.1 ns      0.9x
rng::for_each_n(list<char>)/50              41.3 ns     42.3 ns      1.0x
rng::for_each_n(list<char>)/1024            1036 ns     1064 ns      1.0x
rng::for_each_n(list<char>)/4096            6310 ns     6645 ns      0.9x
rng::for_each_n(list<char>)/8192           12996 ns    13245 ns      1.0x
rng::for_each_n(list<char>)/16384          24803 ns    25932 ns      1.0x
rng::for_each_n(list<char>)/65536         103587 ns   105354 ns      1.0x
rng::for_each_n(list<char>)/262144        550281 ns   753493 ns      0.7x
std::for_each_n(vector<short>)/8            4.42 ns     3.92 ns      1.1x
std::for_each_n(vector<short>)/32           1.62 ns     1.64 ns      1.0x
std::for_each_n(vector<short>)/50           2.74 ns     2.75 ns      1.0x
std::for_each_n(vector<short>)/1024         34.0 ns     33.6 ns      1.0x
std::for_each_n(vector<short>)/4096          120 ns      117 ns      1.0x
std::for_each_n(vector<short>)/8192          229 ns      267 ns      0.9x
std::for_each_n(vector<short>)/16384         452 ns      469 ns      1.0x
std::for_each_n(vector<short>)/65536        2262 ns     2265 ns      1.0x
std::for_each_n(vector<short>)/262144       9129 ns     9140 ns      1.0x
std::for_each_n(deque<short>)/8             5.28 ns     1.78 ns      3.0x
std::for_each_n(deque<short>)/32            22.8 ns     2.08 ns     11.0x
std::for_each_n(deque<short>)/50            32.3 ns     4.46 ns      7.2x
std::for_each_n(deque<short>)/1024           545 ns     35.2 ns     15.5x
std::for_each_n(deque<short>)/4096          2158 ns      128 ns     16.9x
std::for_each_n(deque<short>)/8192          4303 ns      243 ns     17.7x
std::for_each_n(deque<short>)/16384         8624 ns      516 ns     16.7x
std::for_each_n(deque<short>)/65536        34569 ns     2336 ns     14.8x
std::for_each_n(deque<short>)/262144      137820 ns     9319 ns     14.8x
std::for_each_n(list<short>)/8              4.66 ns     4.95 ns      0.9x
std::for_each_n(list<short>)/32             19.9 ns     20.4 ns      1.0x
std::for_each_n(list<short>)/50             41.3 ns     41.1 ns      1.0x
std::for_each_n(list<short>)/1024           1018 ns     1021 ns      1.0x
std::for_each_n(list<short>)/4096           6110 ns     6294 ns      1.0x
std::for_each_n(list<short>)/8192          12433 ns    12692 ns      1.0x
std::for_each_n(list<short>)/16384         24739 ns    24820 ns      1.0x
std::for_each_n(list<short>)/65536        103376 ns   102812 ns      1.0x
std::for_each_n(list<short>)/262144       538314 ns   555664 ns      1.0x
rng::for_each_n(vector<short>)/8            3.84 ns     3.90 ns      1.0x
rng::for_each_n(vector<short>)/32           1.60 ns     1.63 ns      1.0x
rng::for_each_n(vector<short>)/50           2.88 ns     2.88 ns      1.0x
rng::for_each_n(vector<short>)/1024         33.6 ns     33.8 ns      1.0x
rng::for_each_n(vector<short>)/4096          117 ns      117 ns      1.0x
rng::for_each_n(vector<short>)/8192          229 ns      233 ns      1.0x
rng::for_each_n(vector<short>)/16384         456 ns      479 ns      1.0x
rng::for_each_n(vector<short>)/65536        2256 ns     2288 ns      1.0x
rng::for_each_n(vector<short>)/262144       8966 ns     9078 ns      1.0x
rng::for_each_n(deque<short>)/8             6.52 ns     1.97 ns      3.3x
rng::for_each_n(deque<short>)/32            23.7 ns     2.10 ns     11.3x
rng::for_each_n(deque<short>)/50            34.1 ns     4.74 ns      7.2x
rng::for_each_n(deque<short>)/1024           539 ns     35.1 ns     15.4x
rng::for_each_n(deque<short>)/4096          1920 ns      131 ns     14.7x
rng::for_each_n(deque<short>)/8192          3957 ns      255 ns     15.5x
rng::for_each_n(deque<short>)/16384         7807 ns      505 ns     15.5x
rng::for_each_n(deque<short>)/65536        30293 ns     2435 ns     12.4x
rng::for_each_n(deque<short>)/262144      119499 ns     9667 ns     12.4x
rng::for_each_n(list<short>)/8              5.08 ns     5.38 ns      0.9x
rng::for_each_n(list<short>)/32             20.1 ns     20.5 ns      1.0x
rng::for_each_n(list<short>)/50             42.6 ns     41.1 ns      1.0x
rng::for_each_n(list<short>)/1024           1028 ns     1025 ns      1.0x
rng::for_each_n(list<short>)/4096           6857 ns     6311 ns      1.1x
rng::for_each_n(list<short>)/8192          13336 ns    12807 ns      1.0x
rng::for_each_n(list<short>)/16384         26031 ns    25081 ns      1.0x
rng::for_each_n(list<short>)/65536        101849 ns   109759 ns      0.9x
rng::for_each_n(list<short>)/262144       582600 ns   554157 ns      1.1x
std::for_each_n(vector<int>)/8              2.78 ns     2.73 ns      1.0x
std::for_each_n(vector<int>)/32             5.22 ns     5.26 ns      1.0x
std::for_each_n(vector<int>)/50             8.20 ns     8.65 ns      0.9x
std::for_each_n(vector<int>)/1024            156 ns      175 ns      0.9x
std::for_each_n(vector<int>)/4096            602 ns      758 ns      0.8x
std::for_each_n(vector<int>)/8192           1214 ns     1393 ns      0.9x
std::for_each_n(vector<int>)/16384          2417 ns     2690 ns      0.9x
std::for_each_n(vector<int>)/65536          9989 ns    10703 ns      0.9x
std::for_each_n(vector<int>)/262144        41512 ns    43798 ns      0.9x
std::for_each_n(deque<int>)/8               5.04 ns     2.75 ns      1.8x
std::for_each_n(deque<int>)/32              19.1 ns     5.56 ns      3.4x
std::for_each_n(deque<int>)/50              30.6 ns     8.55 ns      3.6x
std::for_each_n(deque<int>)/1024             567 ns      152 ns      3.7x
std::for_each_n(deque<int>)/4096            2241 ns      657 ns      3.4x
std::for_each_n(deque<int>)/8192            4512 ns     1334 ns      3.4x
std::for_each_n(deque<int>)/16384           9066 ns     2701 ns      3.4x
std::for_each_n(deque<int>)/65536          35955 ns    10887 ns      3.3x
std::for_each_n(deque<int>)/262144        146489 ns    44361 ns      3.3x
std::for_each_n(list<int>)/8                4.68 ns     6.05 ns      0.8x
std::for_each_n(list<int>)/32               21.0 ns     21.9 ns      1.0x
std::for_each_n(list<int>)/50               43.0 ns     42.2 ns      1.0x
std::for_each_n(list<int>)/1024             1015 ns     1035 ns      1.0x
std::for_each_n(list<int>)/4096             6373 ns     6331 ns      1.0x
std::for_each_n(list<int>)/8192            12757 ns    12836 ns      1.0x
std::for_each_n(list<int>)/16384           24879 ns    25035 ns      1.0x
std::for_each_n(list<int>)/65536          103931 ns   103773 ns      1.0x
std::for_each_n(list<int>)/262144         536841 ns   555330 ns      1.0x
rng::for_each_n(vector<int>)/8              2.76 ns     2.79 ns      1.0x
rng::for_each_n(vector<int>)/32             5.30 ns     5.22 ns      1.0x
rng::for_each_n(vector<int>)/50             8.09 ns     8.17 ns      1.0x
rng::for_each_n(vector<int>)/1024            152 ns      153 ns      1.0x
rng::for_each_n(vector<int>)/4096            612 ns      608 ns      1.0x
rng::for_each_n(vector<int>)/8192           1206 ns     1220 ns      1.0x
rng::for_each_n(vector<int>)/16384          2428 ns     2451 ns      1.0x
rng::for_each_n(vector<int>)/65536          9852 ns    10112 ns      1.0x
rng::for_each_n(vector<int>)/262144        39133 ns    42646 ns      0.9x
rng::for_each_n(deque<int>)/8               4.39 ns     2.79 ns      1.6x
rng::for_each_n(deque<int>)/32              18.3 ns     5.75 ns      3.2x
rng::for_each_n(deque<int>)/50              29.7 ns     9.29 ns      3.2x
rng::for_each_n(deque<int>)/1024             571 ns      167 ns      3.4x
rng::for_each_n(deque<int>)/4096            2297 ns      649 ns      3.5x
rng::for_each_n(deque<int>)/8192            4497 ns     1248 ns      3.6x
rng::for_each_n(deque<int>)/16384           9025 ns     2513 ns      3.6x
rng::for_each_n(deque<int>)/65536          36321 ns    10063 ns      3.6x
rng::for_each_n(deque<int>)/262144        144304 ns    40555 ns      3.6x
rng::for_each_n(list<int>)/8                6.00 ns     5.12 ns      1.2x
rng::for_each_n(list<int>)/32               22.3 ns     20.5 ns      1.1x
rng::for_each_n(list<int>)/50               41.5 ns     40.5 ns      1.0x
rng::for_each_n(list<int>)/1024             1041 ns     1004 ns      1.0x
rng::for_each_n(list<int>)/4096             6455 ns     6347 ns      1.0x
rng::for_each_n(list<int>)/8192            12870 ns    12753 ns      1.0x
rng::for_each_n(list<int>)/16384           25525 ns    25135 ns      1.0x
rng::for_each_n(list<int>)/65536          103878 ns   103348 ns      1.0x
rng::for_each_n(list<int>)/262144         576571 ns   548541 ns      1.1x
--------------------------------------------------------------------------

`{std, ranges}::for_each` with `deque` iterators

--------------------------------------------------------------------------
Benchmark                                    Before       After    Speedup
--------------------------------------------------------------------------
std::for_each(vector<char>)/8               2.36 ns     2.27 ns      1.0x
std::for_each(vector<char>)/32              2.71 ns     2.72 ns      1.0x
std::for_each(vector<char>)/50              3.93 ns     4.17 ns      0.9x
std::for_each(vector<char>)/1024            40.6 ns     41.3 ns      1.0x
std::for_each(vector<char>)/4096             150 ns      158 ns      0.9x
std::for_each(vector<char>)/8192             293 ns      304 ns      1.0x
std::for_each(vector<char>)/16384            597 ns      615 ns      1.0x
std::for_each(vector<char>)/65536           2471 ns     2478 ns      1.0x
std::for_each(vector<char>)/262144          9665 ns     9878 ns      1.0x
std::for_each(deque<char>)/8                2.33 ns     2.36 ns      1.0x
std::for_each(deque<char>)/32               2.79 ns     2.87 ns      1.0x
std::for_each(deque<char>)/50               4.13 ns     4.13 ns      1.0x
std::for_each(deque<char>)/1024             43.3 ns     42.6 ns      1.0x
std::for_each(deque<char>)/4096              171 ns      177 ns      1.0x
std::for_each(deque<char>)/8192              337 ns      336 ns      1.0x
std::for_each(deque<char>)/16384             658 ns      664 ns      1.0x
std::for_each(deque<char>)/65536            2658 ns     2727 ns      1.0x
std::for_each(deque<char>)/262144          10916 ns    11005 ns      1.0x
std::for_each(list<char>)/8                 4.19 ns     3.94 ns      1.1x
std::for_each(list<char>)/32                35.1 ns     34.6 ns      1.0x
std::for_each(list<char>)/50                57.1 ns     54.2 ns      1.1x
std::for_each(list<char>)/1024              1044 ns     1034 ns      1.0x
std::for_each(list<char>)/4096              6214 ns     6225 ns      1.0x
std::for_each(list<char>)/8192             11791 ns    11629 ns      1.0x
std::for_each(list<char>)/16384            21278 ns    21767 ns      1.0x
std::for_each(list<char>)/65536            97876 ns    97773 ns      1.0x
std::for_each(list<char>)/262144          497406 ns   498083 ns      1.0x
rng::for_each(vector<char>)/8               3.72 ns     2.40 ns      1.5x
rng::for_each(vector<char>)/32              2.94 ns     2.79 ns      1.1x
rng::for_each(vector<char>)/50              9.81 ns     4.08 ns      2.4x
rng::for_each(vector<char>)/1024            46.2 ns     42.2 ns      1.1x
rng::for_each(vector<char>)/4096             171 ns      156 ns      1.1x
rng::for_each(vector<char>)/8192             334 ns      307 ns      1.1x
rng::for_each(vector<char>)/16384            675 ns      611 ns      1.1x
rng::for_each(vector<char>)/65536           2665 ns     2449 ns      1.1x
rng::for_each(vector<char>)/262144         10656 ns     9963 ns      1.1x
rng::for_each(deque<char>)/8                5.16 ns     2.37 ns      2.2x
rng::for_each(deque<char>)/32               23.2 ns     2.80 ns      8.3x
rng::for_each(deque<char>)/50               33.1 ns     4.15 ns      8.0x
rng::for_each(deque<char>)/1024              551 ns     41.9 ns     13.1x
rng::for_each(deque<char>)/4096             2179 ns      170 ns     12.8x
rng::for_each(deque<char>)/8192             4404 ns      344 ns     12.8x
rng::for_each(deque<char>)/16384            8719 ns      666 ns     13.1x
rng::for_each(deque<char>)/65536           34988 ns     2702 ns     13.0x
rng::for_each(deque<char>)/262144         141022 ns    11098 ns     12.7x
rng::for_each(list<char>)/8                 3.86 ns     4.07 ns      0.9x
rng::for_each(list<char>)/32                22.2 ns     34.9 ns      0.6x
rng::for_each(list<char>)/50                55.6 ns     54.2 ns      1.0x
rng::for_each(list<char>)/1024              1018 ns     1025 ns      1.0x
rng::for_each(list<char>)/4096              6661 ns     6690 ns      1.0x
rng::for_each(list<char>)/8192             11840 ns    11128 ns      1.1x
rng::for_each(list<char>)/16384            21107 ns    21612 ns      1.0x
rng::for_each(list<char>)/65536            97611 ns    99755 ns      1.0x
rng::for_each(list<char>)/262144          488435 ns   484463 ns      1.0x
std::for_each(vector<short>)/8              1.56 ns     1.61 ns      1.0x
std::for_each(vector<short>)/32             1.57 ns     1.63 ns      1.0x
std::for_each(vector<short>)/50             2.83 ns     2.82 ns      1.0x
std::for_each(vector<short>)/1024           37.1 ns     33.7 ns      1.1x
std::for_each(vector<short>)/4096            134 ns      133 ns      1.0x
std::for_each(vector<short>)/8192            235 ns      232 ns      1.0x
std::for_each(vector<short>)/16384           461 ns      457 ns      1.0x
std::for_each(vector<short>)/65536          2307 ns     2486 ns      0.9x
std::for_each(vector<short>)/262144         9273 ns     9248 ns      1.0x
std::for_each(deque<short>)/8               1.59 ns     1.56 ns      1.0x
std::for_each(deque<short>)/32              1.55 ns     1.55 ns      1.0x
std::for_each(deque<short>)/50              2.79 ns     2.81 ns      1.0x
std::for_each(deque<short>)/1024            34.0 ns     37.1 ns      0.9x
std::for_each(deque<short>)/4096             122 ns      127 ns      1.0x
std::for_each(deque<short>)/8192             247 ns      236 ns      1.0x
std::for_each(deque<short>)/16384            484 ns      469 ns      1.0x
std::for_each(deque<short>)/65536           2328 ns     2272 ns      1.0x
std::for_each(deque<short>)/262144          9203 ns     9214 ns      1.0x
std::for_each(list<short>)/8                3.44 ns     3.64 ns      0.9x
std::for_each(list<short>)/32               23.7 ns     20.8 ns      1.1x
std::for_each(list<short>)/50               52.6 ns     56.3 ns      0.9x
std::for_each(list<short>)/1024             1025 ns     1031 ns      1.0x
std::for_each(list<short>)/4096             6100 ns     6250 ns      1.0x
std::for_each(list<short>)/8192            11627 ns    11765 ns      1.0x
std::for_each(list<short>)/16384           22026 ns    21348 ns      1.0x
std::for_each(list<short>)/65536          104321 ns   102664 ns      1.0x
std::for_each(list<short>)/262144         521524 ns   498252 ns      1.0x
rng::for_each(vector<short>)/8              4.56 ns     1.55 ns      2.9x
rng::for_each(vector<short>)/32             1.76 ns     1.61 ns      1.1x
rng::for_each(vector<short>)/50             2.69 ns     2.90 ns      0.9x
rng::for_each(vector<short>)/1024           33.3 ns     34.4 ns      1.0x
rng::for_each(vector<short>)/4096            121 ns      117 ns      1.0x
rng::for_each(vector<short>)/8192            231 ns      232 ns      1.0x
rng::for_each(vector<short>)/16384           461 ns      457 ns      1.0x
rng::for_each(vector<short>)/65536          2251 ns     2249 ns      1.0x
rng::for_each(vector<short>)/262144         9080 ns     9064 ns      1.0x
rng::for_each(deque<short>)/8               4.86 ns     1.59 ns      3.1x
rng::for_each(deque<short>)/32              23.9 ns     1.56 ns     15.3x
rng::for_each(deque<short>)/50              36.2 ns     2.91 ns     12.4x
rng::for_each(deque<short>)/1024             637 ns     34.4 ns     18.5x
rng::for_each(deque<short>)/4096            2486 ns      125 ns     19.9x
rng::for_each(deque<short>)/8192            5039 ns      237 ns     21.3x
rng::for_each(deque<short>)/16384           9968 ns      474 ns     21.0x
rng::for_each(deque<short>)/65536          39995 ns     2294 ns     17.4x
rng::for_each(deque<short>)/262144        161619 ns     9273 ns     17.4x
rng::for_each(list<short>)/8                3.92 ns     3.85 ns      1.0x
rng::for_each(list<short>)/32               35.6 ns     21.4 ns      1.7x
rng::for_each(list<short>)/50               53.8 ns     53.9 ns      1.0x
rng::for_each(list<short>)/1024             1026 ns     1027 ns      1.0x
rng::for_each(list<short>)/4096             6646 ns     6574 ns      1.0x
rng::for_each(list<short>)/8192            11429 ns    11104 ns      1.0x
rng::for_each(list<short>)/16384           21677 ns    21029 ns      1.0x
rng::for_each(list<short>)/65536          105132 ns   102157 ns      1.0x
rng::for_each(list<short>)/262144         483564 ns   482510 ns      1.0x
std::for_each(vector<int>)/8                2.76 ns     2.76 ns      1.0x
std::for_each(vector<int>)/32               5.28 ns     5.24 ns      1.0x
std::for_each(vector<int>)/50               7.93 ns     8.06 ns      1.0x
std::for_each(vector<int>)/1024              156 ns      155 ns      1.0x
std::for_each(vector<int>)/4096              609 ns      615 ns      1.0x
std::for_each(vector<int>)/8192             1187 ns     1217 ns      1.0x
std::for_each(vector<int>)/16384            2385 ns     2446 ns      1.0x
std::for_each(vector<int>)/65536            9613 ns     9735 ns      1.0x
std::for_each(vector<int>)/262144          38775 ns    40545 ns      1.0x
std::for_each(deque<int>)/8                 2.74 ns     2.77 ns      1.0x
std::for_each(deque<int>)/32                5.36 ns     5.32 ns      1.0x
std::for_each(deque<int>)/50                8.44 ns     7.94 ns      1.1x
std::for_each(deque<int>)/1024               178 ns      156 ns      1.1x
std::for_each(deque<int>)/4096               689 ns      644 ns      1.1x
std::for_each(deque<int>)/8192              1345 ns     1273 ns      1.1x
std::for_each(deque<int>)/16384             2877 ns     2556 ns      1.1x
std::for_each(deque<int>)/65536            11167 ns    10196 ns      1.1x
std::for_each(deque<int>)/262144           42527 ns    40692 ns      1.0x
std::for_each(list<int>)/8                  4.02 ns     3.74 ns      1.1x
std::for_each(list<int>)/32                 38.4 ns     21.0 ns      1.8x
std::for_each(list<int>)/50                 56.9 ns     54.2 ns      1.0x
std::for_each(list<int>)/1024               1018 ns     1021 ns      1.0x
std::for_each(list<int>)/4096               6570 ns     6640 ns      1.0x
std::for_each(list<int>)/8192              11447 ns    11230 ns      1.0x
std::for_each(list<int>)/16384             20943 ns    21013 ns      1.0x
std::for_each(list<int>)/65536            106761 ns   106624 ns      1.0x
std::for_each(list<int>)/262144           533213 ns   545600 ns      1.0x
rng::for_each(vector<int>)/8                2.93 ns     2.82 ns      1.0x
rng::for_each(vector<int>)/32               5.57 ns     5.42 ns      1.0x
rng::for_each(vector<int>)/50               8.27 ns     7.99 ns      1.0x
rng::for_each(vector<int>)/1024              154 ns      156 ns      1.0x
rng::for_each(vector<int>)/4096              611 ns      606 ns      1.0x
rng::for_each(vector<int>)/8192             1194 ns     1203 ns      1.0x
rng::for_each(vector<int>)/16384            2423 ns     2442 ns      1.0x
rng::for_each(vector<int>)/65536            9702 ns     9960 ns      1.0x
rng::for_each(vector<int>)/262144          39326 ns    41502 ns      0.9x
rng::for_each(deque<int>)/8                 4.64 ns     2.81 ns      1.7x
rng::for_each(deque<int>)/32                20.8 ns     5.28 ns      3.9x
rng::for_each(deque<int>)/50                35.5 ns     8.01 ns      4.4x
rng::for_each(deque<int>)/1024               640 ns      170 ns      3.8x
rng::for_each(deque<int>)/4096              2589 ns      672 ns      3.9x
rng::for_each(deque<int>)/8192              5033 ns     1340 ns      3.8x
rng::for_each(deque<int>)/16384            10136 ns     2794 ns      3.6x
rng::for_each(deque<int>)/65536            40210 ns    10524 ns      3.8x
rng::for_each(deque<int>)/262144          164145 ns    42007 ns      3.9x
rng::for_each(list<int>)/8                  4.08 ns     3.88 ns      1.1x
rng::for_each(list<int>)/32                 35.1 ns     21.5 ns      1.6x
rng::for_each(list<int>)/50                 54.1 ns     55.8 ns      1.0x
rng::for_each(list<int>)/1024               1041 ns     1094 ns      1.0x
rng::for_each(list<int>)/4096               6607 ns     6955 ns      1.0x
rng::for_each(list<int>)/8192              11412 ns    11509 ns      1.0x
rng::for_each(list<int>)/16384             21225 ns    21480 ns      1.0x
rng::for_each(list<int>)/65536            102125 ns   106719 ns      1.0x
rng::for_each(list<int>)/262144           521829 ns   521055 ns      1.0x
--------------------------------------------------------------------------

`{std, ranges}::for_each_{, n}` with `join_view` iterators

---------------------------------------------------------------------------------------------
Benchmark                                                       Before      After    Speedup
---------------------------------------------------------------------------------------------
std::for_each(join_view(vector<vector<char>>))/8               2.25 ns     2.25 ns      1.0x
std::for_each(join_view(vector<vector<char>>))/32              2.66 ns     2.65 ns      1.0x
std::for_each(join_view(vector<vector<char>>))/50              4.81 ns     4.89 ns      1.0x
std::for_each(join_view(vector<vector<char>>))/1024            40.5 ns     40.3 ns      1.0x
std::for_each(join_view(vector<vector<char>>))/4096             159 ns      160 ns      1.0x
std::for_each(join_view(vector<vector<char>>))/8192             324 ns      324 ns      1.0x
std::for_each(join_view(vector<vector<char>>))/16384            651 ns      639 ns      1.0x
std::for_each(join_view(vector<vector<char>>))/65536           2645 ns     2617 ns      1.0x
std::for_each(join_view(vector<vector<char>>))/262144         10690 ns    10415 ns      1.0x
std::for_each(join_view(vector<vector<short>>))/8              2.23 ns     2.15 ns      1.0x
std::for_each(join_view(vector<vector<short>>))/32             2.26 ns     2.29 ns      1.0x
std::for_each(join_view(vector<vector<short>>))/50             4.30 ns     4.60 ns      0.9x
std::for_each(join_view(vector<vector<short>>))/1024           39.4 ns     41.0 ns      1.0x
std::for_each(join_view(vector<vector<short>>))/4096            182 ns      182 ns      1.0x
std::for_each(join_view(vector<vector<short>>))/8192            350 ns      363 ns      1.0x
std::for_each(join_view(vector<vector<short>>))/16384           707 ns      716 ns      1.0x
std::for_each(join_view(vector<vector<short>>))/65536          2992 ns     3164 ns      0.9x
std::for_each(join_view(vector<vector<short>>))/262144        11883 ns    12178 ns      1.0x
std::for_each(join_view(vector<vector<int>>))/8                2.83 ns     2.92 ns      1.0x
std::for_each(join_view(vector<vector<int>>))/32               6.01 ns     6.33 ns      0.9x
std::for_each(join_view(vector<vector<int>>))/50               9.27 ns     9.60 ns      1.0x
std::for_each(join_view(vector<vector<int>>))/1024              172 ns      173 ns      1.0x
std::for_each(join_view(vector<vector<int>>))/4096              695 ns      699 ns      1.0x
std::for_each(join_view(vector<vector<int>>))/8192             1361 ns     1387 ns      1.0x
std::for_each(join_view(vector<vector<int>>))/16384            2789 ns     2993 ns      0.9x
std::for_each(join_view(vector<vector<int>>))/65536           11228 ns    11184 ns      1.0x
std::for_each(join_view(vector<vector<int>>))/262144          44412 ns    47894 ns      0.9x
rng::for_each(join_view(vector<vector<char>>))/8               6.39 ns     2.44 ns      2.6x
rng::for_each(join_view(vector<vector<char>>))/32              32.3 ns     2.84 ns     11.4x
rng::for_each(join_view(vector<vector<char>>))/50              41.8 ns     5.14 ns      8.1x
rng::for_each(join_view(vector<vector<char>>))/1024             744 ns     44.5 ns     16.7x
rng::for_each(join_view(vector<vector<char>>))/4096            3069 ns      172 ns     17.8x
rng::for_each(join_view(vector<vector<char>>))/8192            5988 ns      345 ns     17.4x
rng::for_each(join_view(vector<vector<char>>))/16384          11820 ns      696 ns     17.0x
rng::for_each(join_view(vector<vector<char>>))/65536          48948 ns     2764 ns     17.7x
rng::for_each(join_view(vector<vector<char>>))/262144        192328 ns    10913 ns     17.6x
rng::for_each(join_view(vector<vector<short>>))/8              7.07 ns     2.42 ns      2.9x
rng::for_each(join_view(vector<vector<short>>))/32             37.1 ns     2.67 ns     13.9x
rng::for_each(join_view(vector<vector<short>>))/50             50.4 ns     4.99 ns     10.1x
rng::for_each(join_view(vector<vector<short>>))/1024            738 ns     34.5 ns     21.4x
rng::for_each(join_view(vector<vector<short>>))/4096           2943 ns      138 ns     21.3x
rng::for_each(join_view(vector<vector<short>>))/8192           5828 ns      265 ns     22.0x
rng::for_each(join_view(vector<vector<short>>))/16384         11746 ns      531 ns     22.1x
rng::for_each(join_view(vector<vector<short>>))/65536         48087 ns     2594 ns     18.5x
rng::for_each(join_view(vector<vector<short>>))/262144       188488 ns    10406 ns     18.1x
rng::for_each(join_view(vector<vector<int>>))/8                6.28 ns     2.81 ns      2.2x
rng::for_each(join_view(vector<vector<int>>))/32               28.2 ns     6.53 ns      4.3x
rng::for_each(join_view(vector<vector<int>>))/50               41.6 ns     10.1 ns      4.1x
rng::for_each(join_view(vector<vector<int>>))/1024              720 ns      178 ns      4.0x
rng::for_each(join_view(vector<vector<int>>))/4096             2772 ns      744 ns      3.7x
rng::for_each(join_view(vector<vector<int>>))/8192             5575 ns     1502 ns      3.7x
rng::for_each(join_view(vector<vector<int>>))/16384           11323 ns     2988 ns      3.8x
rng::for_each(join_view(vector<vector<int>>))/65536           44912 ns    11843 ns      3.8x
rng::for_each(join_view(vector<vector<int>>))/262144         184685 ns    47666 ns      3.9x
std::for_each_n(join_view(vector<vector<char>>))/8             5.03 ns     2.44 ns      2.1x
std::for_each_n(join_view(vector<vector<char>>))/32            22.5 ns     2.80 ns      8.0x
std::for_each_n(join_view(vector<vector<char>>))/50            30.5 ns     5.26 ns      5.8x
std::for_each_n(join_view(vector<vector<char>>))/1024           478 ns     51.9 ns      9.2x
std::for_each_n(join_view(vector<vector<char>>))/4096          1896 ns      165 ns     11.5x
std::for_each_n(join_view(vector<vector<char>>))/8192          3867 ns      346 ns     11.2x
std::for_each_n(join_view(vector<vector<char>>))/16384         7660 ns      682 ns     11.2x
std::for_each_n(join_view(vector<vector<char>>))/65536        30498 ns     4234 ns      7.2x
std::for_each_n(join_view(vector<vector<char>>))/262144      122379 ns    12491 ns      9.8x
std::for_each_n(join_view(vector<vector<short>>))/8            5.59 ns     2.42 ns      2.3x
std::for_each_n(join_view(vector<vector<short>>))/32           22.8 ns     2.50 ns      9.1x
std::for_each_n(join_view(vector<vector<short>>))/50           30.0 ns     5.05 ns      5.9x
std::for_each_n(join_view(vector<vector<short>>))/1024          481 ns     42.9 ns     11.2x
std::for_each_n(join_view(vector<vector<short>>))/4096         1943 ns      199 ns      9.8x
std::for_each_n(join_view(vector<vector<short>>))/8192         3840 ns      371 ns     10.3x
std::for_each_n(join_view(vector<vector<short>>))/16384        7638 ns      728 ns     10.5x
std::for_each_n(join_view(vector<vector<short>>))/65536       31207 ns     2920 ns     10.7x
std::for_each_n(join_view(vector<vector<short>>))/262144     125150 ns    11799 ns     10.6x
std::for_each_n(join_view(vector<vector<int>>))/8              5.40 ns     2.90 ns      1.9x
std::for_each_n(join_view(vector<vector<int>>))/32             21.6 ns     6.82 ns      3.2x
std::for_each_n(join_view(vector<vector<int>>))/50             29.0 ns     9.53 ns      3.0x
std::for_each_n(join_view(vector<vector<int>>))/1024            473 ns      173 ns      2.7x
std::for_each_n(join_view(vector<vector<int>>))/4096           1890 ns      707 ns      2.7x
std::for_each_n(join_view(vector<vector<int>>))/8192           3763 ns     1397 ns      2.7x
std::for_each_n(join_view(vector<vector<int>>))/16384          7690 ns     2835 ns      2.7x
std::for_each_n(join_view(vector<vector<int>>))/65536         30403 ns    11352 ns      2.7x
std::for_each_n(join_view(vector<vector<int>>))/262144       124215 ns    46235 ns      2.7x
rng::for_each_n(join_view(vector<vector<char>>))/8             5.93 ns     2.39 ns      2.5x
rng::for_each_n(join_view(vector<vector<char>>))/32            26.4 ns     2.84 ns      9.3x
rng::for_each_n(join_view(vector<vector<char>>))/50            38.6 ns     5.59 ns      6.9x
rng::for_each_n(join_view(vector<vector<char>>))/1024           686 ns     44.0 ns     15.6x
rng::for_each_n(join_view(vector<vector<char>>))/4096          3223 ns      172 ns     18.7x
rng::for_each_n(join_view(vector<vector<char>>))/8192          8771 ns      352 ns     24.9x
rng::for_each_n(join_view(vector<vector<char>>))/16384        15115 ns      701 ns     21.6x
rng::for_each_n(join_view(vector<vector<char>>))/65536        62153 ns     3017 ns     20.6x
rng::for_each_n(join_view(vector<vector<char>>))/262144      249936 ns    11436 ns     21.9x
rng::for_each_n(join_view(vector<vector<short>>))/8            7.30 ns     2.52 ns      2.9x
rng::for_each_n(join_view(vector<vector<short>>))/32           30.6 ns     2.47 ns     12.4x
rng::for_each_n(join_view(vector<vector<short>>))/50           37.1 ns     4.78 ns      7.8x
rng::for_each_n(join_view(vector<vector<short>>))/1024          674 ns     36.8 ns     18.3x
rng::for_each_n(join_view(vector<vector<short>>))/4096         2686 ns      141 ns     19.0x
rng::for_each_n(join_view(vector<vector<short>>))/8192         5415 ns      273 ns     19.8x
rng::for_each_n(join_view(vector<vector<short>>))/16384       12075 ns      523 ns     23.1x
rng::for_each_n(join_view(vector<vector<short>>))/65536       45979 ns     2495 ns     18.4x
rng::for_each_n(join_view(vector<vector<short>>))/262144     188528 ns    10266 ns     18.4x
rng::for_each_n(join_view(vector<vector<int>>))/8              6.71 ns     2.89 ns      2.3x
rng::for_each_n(join_view(vector<vector<int>>))/32             26.1 ns     6.48 ns      4.0x
rng::for_each_n(join_view(vector<vector<int>>))/50             37.6 ns     9.55 ns      3.9x
rng::for_each_n(join_view(vector<vector<int>>))/1024            636 ns      168 ns      3.8x
rng::for_each_n(join_view(vector<vector<int>>))/4096           2657 ns      697 ns      3.8x
rng::for_each_n(join_view(vector<vector<int>>))/8192           5082 ns     1363 ns      3.7x
rng::for_each_n(join_view(vector<vector<int>>))/16384         10629 ns     2764 ns      3.8x
rng::for_each_n(join_view(vector<vector<int>>))/65536         42324 ns    11006 ns      3.8x
rng::for_each_n(join_view(vector<vector<int>>))/262144       169755 ns    44317 ns      3.8x
---------------------------------------------------------------------------------------------

libcxx/include/__algorithm/for_each_n.h

llvmbot · 2025-03-25T15:59:57Z

@llvm/pr-subscribers-libcxx

Author: Peng Liu (winner245)

Changes

This patch extends segmented iterator optimizations, previously applied to std::for_each, to std::for_each_n, std::ranges::for_each, and std::ranges::for_each_n by forwarding to std::for_each. New tests validate these optimizations for segmented iterators (e.g., deque<int> and join_view iterators). Benchmarks demonstrate up to 3.9x performance improvement for deque<int> iterators, aligning their performance with contiguous iterators (e.g., vector<int>). The vector<int> performance serves as a baseline for contiguous iterators, representing the upper bound for segmented deque<int> inputs.

Addresses a subtask of #102817.

`for_each_n`

--------------------------------------------------------------------------------
Benchmark                                       Before          After    Speedup
--------------------------------------------------------------------------------
std::for_each_n(deque&lt;int&gt;)/8                  5.31 ns         3.39 ns      1.6x
std::for_each_n(deque&lt;int&gt;)/32                 20.1 ns         6.89 ns      2.9x
std::for_each_n(deque&lt;int&gt;)/1024                612 ns          171 ns      3.6x
std::for_each_n(deque&lt;int&gt;)/8192               4892 ns         1350 ns      3.6x
std::for_each_n(deque&lt;int&gt;)/16384              9786 ns         2774 ns      3.5x
std::for_each_n(deque&lt;int&gt;)/65536             39026 ns        11339 ns      3.4x
std::for_each_n(deque&lt;int&gt;)/262144           157897 ns        45166 ns      3.5x
std::for_each_n(deque&lt;int&gt;)/1048576          643836 ns       184999 ns      3.5x
rng::for_each_n(deque&lt;int&gt;)/8                  4.85 ns         4.94 ns      1.0x
rng::for_each_n(deque&lt;int&gt;)/32                 18.1 ns         8.47 ns      2.1x
rng::for_each_n(deque&lt;int&gt;)/1024                622 ns          171 ns      3.6x
rng::for_each_n(deque&lt;int&gt;)/8192               5008 ns         1363 ns      3.7x
rng::for_each_n(deque&lt;int&gt;)/16384              9952 ns         2744 ns      3.6x
rng::for_each_n(deque&lt;int&gt;)/65536             40204 ns        10841 ns      3.7x
rng::for_each_n(deque&lt;int&gt;)/262144           157713 ns        43386 ns      3.6x
rng::for_each_n(deque&lt;int&gt;)/1048576          637549 ns       177042 ns      3.6x
std::for_each_n(vector&lt;int&gt;)/8                 2.91 ns         2.94 ns      1.0x
std::for_each_n(vector&lt;int&gt;)/32                5.42 ns         5.54 ns      1.0x
std::for_each_n(vector&lt;int&gt;)/1024               161 ns          165 ns      1.0x
std::for_each_n(vector&lt;int&gt;)/8192              1271 ns         1292 ns      1.0x
std::for_each_n(vector&lt;int&gt;)/16384             2556 ns         2619 ns      1.0x
std::for_each_n(vector&lt;int&gt;)/65536            10125 ns        10659 ns      1.0x
std::for_each_n(vector&lt;int&gt;)/262144           44572 ns        44372 ns      1.0x
std::for_each_n(vector&lt;int&gt;)/1048576         180804 ns       183389 ns      1.0x
rng::for_each_n(vector&lt;int&gt;)/8                 3.05 ns         3.05 ns      1.0x
rng::for_each_n(vector&lt;int&gt;)/32                5.71 ns         5.85 ns      1.0x
rng::for_each_n(vector&lt;int&gt;)/1024               167 ns          183 ns      0.9x
rng::for_each_n(vector&lt;int&gt;)/8192              1298 ns         1429 ns      0.9x
rng::for_each_n(vector&lt;int&gt;)/16384             2691 ns         2870 ns      0.9x
rng::for_each_n(vector&lt;int&gt;)/65536            10632 ns        11465 ns      0.9x
rng::for_each_n(vector&lt;int&gt;)/262144           53031 ns        45948 ns      1.2x
rng::for_each_n(vector&lt;int&gt;)/1048576         174328 ns       184270 ns      0.9x

`for_each`

--------------------------------------------------------------------------------
Benchmark                                     Before           After     Speedup
--------------------------------------------------------------------------------
std::for_each(deque&lt;int&gt;)/8                  3.18 ns         2.96 ns        1.1x
std::for_each(deque&lt;int&gt;)/32                 5.70 ns         5.54 ns        1.0x
std::for_each(deque&lt;int&gt;)/1024                183 ns          180 ns        1.0x
std::for_each(deque&lt;int&gt;)/8192               1435 ns         1422 ns        1.0x
std::for_each(deque&lt;int&gt;)/16384              2885 ns         2879 ns        1.0x
std::for_each(deque&lt;int&gt;)/65536             11423 ns        11378 ns        1.0x
std::for_each(deque&lt;int&gt;)/262144            45203 ns        43686 ns        1.0x
std::for_each(deque&lt;int&gt;)/1048576          181832 ns       173832 ns        1.0x
rng::for_each(deque&lt;int&gt;)/8                  5.10 ns         3.75 ns        1.4x
rng::for_each(deque&lt;int&gt;)/32                 23.5 ns         7.49 ns        3.1x
rng::for_each(deque&lt;int&gt;)/1024                693 ns          184 ns        3.8x
rng::for_each(deque&lt;int&gt;)/8192               5522 ns         1430 ns        3.9x
rng::for_each(deque&lt;int&gt;)/16384             11112 ns         2930 ns        3.8x
rng::for_each(deque&lt;int&gt;)/65536             44390 ns        11656 ns        3.8x
rng::for_each(deque&lt;int&gt;)/262144           179419 ns        46582 ns        3.9x
rng::for_each(deque&lt;int&gt;)/1048576          711406 ns       189658 ns        3.8x
std::for_each(vector&lt;int&gt;)/8                 2.96 ns         2.91 ns        1.0x
std::for_each(vector&lt;int&gt;)/32                5.54 ns         5.49 ns        1.0x
std::for_each(vector&lt;int&gt;)/1024               165 ns          162 ns        1.0x
std::for_each(vector&lt;int&gt;)/8192              1269 ns         1257 ns        1.0x
std::for_each(vector&lt;int&gt;)/16384             2636 ns         2567 ns        1.0x
std::for_each(vector&lt;int&gt;)/65536            10231 ns        10215 ns        1.0x
std::for_each(vector&lt;int&gt;)/262144           41544 ns        40719 ns        1.0x
std::for_each(vector&lt;int&gt;)/1048576         173667 ns       167878 ns        1.0x
rng::for_each(vector&lt;int&gt;)/8                 3.09 ns         3.06 ns        1.0x
rng::for_each(vector&lt;int&gt;)/32                5.85 ns         5.77 ns        1.0x
rng::for_each(vector&lt;int&gt;)/1024               179 ns          168 ns        1.1x
rng::for_each(vector&lt;int&gt;)/8192              1346 ns         1309 ns        1.0x
rng::for_each(vector&lt;int&gt;)/16384             2714 ns         2664 ns        1.0x
rng::for_each(vector&lt;int&gt;)/65536            10979 ns        10523 ns        1.0x
rng::for_each(vector&lt;int&gt;)/262144           42994 ns        42535 ns        1.0x
rng::for_each(vector&lt;int&gt;)/1048576         175633 ns       173933 ns        1.0x

Full diff: https://github.com/llvm/llvm-project/pull/132896.diff

8 Files Affected:

(modified) libcxx/include/__algorithm/for_each_n.h (+24-1)
(modified) libcxx/include/__algorithm/ranges_for_each.h (+11-3)
(modified) libcxx/include/__algorithm/ranges_for_each_n.h (+11-4)
(added) libcxx/test/benchmarks/algorithms/nonmodifying/for_each_n.bench.cpp (+57)
(modified) libcxx/test/libcxx/algorithms/ranges_robust_against_copying_comparators.pass.cpp (+1-1)
(modified) libcxx/test/std/algorithms/alg.nonmodifying/alg.foreach/for_each_n.pass.cpp (+82-38)
(modified) libcxx/test/std/algorithms/alg.nonmodifying/alg.foreach/ranges.for_each.pass.cpp (+41-5)
(modified) libcxx/test/std/algorithms/alg.nonmodifying/alg.foreach/ranges.for_each_n.pass.cpp (+44-2)

diff --git a/libcxx/include/__algorithm/for_each_n.h b/libcxx/include/__algorithm/for_each_n.h
index fce380b49df3e..3d91124432f56 100644
--- a/libcxx/include/__algorithm/for_each_n.h
+++ b/libcxx/include/__algorithm/for_each_n.h
@@ -10,7 +10,11 @@
 #ifndef _LIBCPP___ALGORITHM_FOR_EACH_N_H
 #define _LIBCPP___ALGORITHM_FOR_EACH_N_H
 
+#include <__algorithm/for_each.h>
 #include <__config>
+#include <__iterator/iterator_traits.h>
+#include <__iterator/segmented_iterator.h>
+#include <__type_traits/enable_if.h>
 #include <__utility/convert_to_integral.h>
 
 #if !defined(_LIBCPP_HAS_NO_PRAGMA_SYSTEM_HEADER)
@@ -21,7 +25,13 @@ _LIBCPP_BEGIN_NAMESPACE_STD
 
 #if _LIBCPP_STD_VER >= 17
 
-template <class _InputIterator, class _Size, class _Function>
+template <class _InputIterator,
+          class _Size,
+          class _Function,
+          __enable_if_t<!__is_segmented_iterator<_InputIterator>::value ||
+                            (__has_input_iterator_category<_InputIterator>::value &&
+                             !__has_random_access_iterator_category<_InputIterator>::value),
+                        int> = 0>
 inline _LIBCPP_HIDE_FROM_ABI _LIBCPP_CONSTEXPR_SINCE_CXX20 _InputIterator
 for_each_n(_InputIterator __first, _Size __orig_n, _Function __f) {
   typedef decltype(std::__convert_to_integral(__orig_n)) _IntegralSize;
@@ -34,6 +44,19 @@ for_each_n(_InputIterator __first, _Size __orig_n, _Function __f) {
   return __first;
 }
 
+template <class _InputIterator,
+          class _Size,
+          class _Function,
+          __enable_if_t<__is_segmented_iterator<_InputIterator>::value &&
+                            __has_random_access_iterator_category<_InputIterator>::value,
+                        int> = 0>
+inline _LIBCPP_HIDE_FROM_ABI _LIBCPP_CONSTEXPR_SINCE_CXX20 _InputIterator
+for_each_n(_InputIterator __first, _Size __orig_n, _Function __f) {
+  _InputIterator __last = __first + __orig_n;
+  std::for_each(__first, __last, __f);
+  return __last;
+}
+
 #endif
 
 _LIBCPP_END_NAMESPACE_STD
diff --git a/libcxx/include/__algorithm/ranges_for_each.h b/libcxx/include/__algorithm/ranges_for_each.h
index de39bc5522753..475f85366188e 100644
--- a/libcxx/include/__algorithm/ranges_for_each.h
+++ b/libcxx/include/__algorithm/ranges_for_each.h
@@ -9,6 +9,7 @@
 #ifndef _LIBCPP___ALGORITHM_RANGES_FOR_EACH_H
 #define _LIBCPP___ALGORITHM_RANGES_FOR_EACH_H
 
+#include <__algorithm/for_each.h>
 #include <__algorithm/in_fun_result.h>
 #include <__config>
 #include <__functional/identity.h>
@@ -41,9 +42,16 @@ struct __for_each {
   template <class _Iter, class _Sent, class _Proj, class _Func>
   _LIBCPP_HIDE_FROM_ABI constexpr static for_each_result<_Iter, _Func>
   __for_each_impl(_Iter __first, _Sent __last, _Func& __func, _Proj& __proj) {
-    for (; __first != __last; ++__first)
-      std::invoke(__func, std::invoke(__proj, *__first));
-    return {std::move(__first), std::move(__func)};
+    if constexpr (random_access_iterator<_Iter> && sized_sentinel_for<_Sent, _Iter>) {
+      auto __n   = __last - __first;
+      auto __end = __first + __n;
+      std::for_each(__first, __end, [&](auto&& __val) { std::invoke(__func, std::invoke(__proj, __val)); });
+      return {std::move(__end), std::move(__func)};
+    } else {
+      for (; __first != __last; ++__first)
+        std::invoke(__func, std::invoke(__proj, *__first));
+      return {std::move(__first), std::move(__func)};
+    }
   }
 
 public:
diff --git a/libcxx/include/__algorithm/ranges_for_each_n.h b/libcxx/include/__algorithm/ranges_for_each_n.h
index 603cb723233c8..3108d66001295 100644
--- a/libcxx/include/__algorithm/ranges_for_each_n.h
+++ b/libcxx/include/__algorithm/ranges_for_each_n.h
@@ -9,6 +9,7 @@
 #ifndef _LIBCPP___ALGORITHM_RANGES_FOR_EACH_N_H
 #define _LIBCPP___ALGORITHM_RANGES_FOR_EACH_N_H
 
+#include <__algorithm/for_each.h>
 #include <__algorithm/in_fun_result.h>
 #include <__config>
 #include <__functional/identity.h>
@@ -40,11 +41,17 @@ struct __for_each_n {
   template <input_iterator _Iter, class _Proj = identity, indirectly_unary_invocable<projected<_Iter, _Proj>> _Func>
   _LIBCPP_HIDE_FROM_ABI constexpr for_each_n_result<_Iter, _Func>
   operator()(_Iter __first, iter_difference_t<_Iter> __count, _Func __func, _Proj __proj = {}) const {
-    while (__count-- > 0) {
-      std::invoke(__func, std::invoke(__proj, *__first));
-      ++__first;
+    if constexpr (random_access_iterator<_Iter>) {
+      auto __last = __first + __count;
+      std::for_each(__first, __last, [&](auto&& __val) { std::invoke(__func, std::invoke(__proj, __val)); });
+      return {std::move(__last), std::move(__func)};
+    } else {
+      while (__count-- > 0) {
+        std::invoke(__func, std::invoke(__proj, *__first));
+        ++__first;
+      }
+      return {std::move(__first), std::move(__func)};
     }
-    return {std::move(__first), std::move(__func)};
   }
 };
 
diff --git a/libcxx/test/benchmarks/algorithms/nonmodifying/for_each_n.bench.cpp b/libcxx/test/benchmarks/algorithms/nonmodifying/for_each_n.bench.cpp
new file mode 100644
index 0000000000000..af46371881577
--- /dev/null
+++ b/libcxx/test/benchmarks/algorithms/nonmodifying/for_each_n.bench.cpp
@@ -0,0 +1,57 @@
+//===----------------------------------------------------------------------===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+
+// UNSUPPORTED: c++03, c++11, c++14, c++17
+
+#include <algorithm>
+#include <cstddef>
+#include <deque>
+#include <list>
+#include <string>
+#include <vector>
+
+#include <benchmark/benchmark.h>
+
+int main(int argc, char** argv) {
+  auto std_for_each_n = [](auto first, auto n, auto f) { return std::for_each_n(first, n, f); };
+
+  // {std,ranges}::for_each_n
+  {
+    auto bm = []<class Container>(std::string name, auto for_each_n) {
+      benchmark::RegisterBenchmark(
+          name,
+          [for_each_n](auto& st) {
+            std::size_t const n = st.range(0);
+            Container c(n, 1);
+            auto first = c.begin();
+
+            for ([[maybe_unused]] auto _ : st) {
+              benchmark::DoNotOptimize(c);
+              auto result = for_each_n(first, n, [](int& x) { x = std::clamp(x, 10, 100); });
+              benchmark::DoNotOptimize(result);
+            }
+          })
+          ->Arg(8)
+          ->Arg(32)
+          ->Arg(50) // non power-of-two
+          ->Arg(8192)
+          ->Arg(1 << 20);
+    };
+    bm.operator()<std::vector<int>>("std::for_each_n(vector<int>)", std_for_each_n);
+    bm.operator()<std::deque<int>>("std::for_each_n(deque<int>)", std_for_each_n);
+    bm.operator()<std::list<int>>("std::for_each_n(list<int>)", std_for_each_n);
+    bm.operator()<std::vector<int>>("rng::for_each_n(vector<int>)", std::ranges::for_each_n);
+    bm.operator()<std::deque<int>>("rng::for_each_n(deque<int>)", std::ranges::for_each_n);
+    bm.operator()<std::list<int>>("rng::for_each_n(list<int>)", std::ranges::for_each_n);
+  }
+
+  benchmark::Initialize(&argc, argv);
+  benchmark::RunSpecifiedBenchmarks();
+  benchmark::Shutdown();
+  return 0;
+}
diff --git a/libcxx/test/libcxx/algorithms/ranges_robust_against_copying_comparators.pass.cpp b/libcxx/test/libcxx/algorithms/ranges_robust_against_copying_comparators.pass.cpp
index dd026444330ea..beb4c7f675a6e 100644
--- a/libcxx/test/libcxx/algorithms/ranges_robust_against_copying_comparators.pass.cpp
+++ b/libcxx/test/libcxx/algorithms/ranges_robust_against_copying_comparators.pass.cpp
@@ -258,7 +258,7 @@ constexpr bool all_the_algorithms()
 int main(int, char**)
 {
     all_the_algorithms();
-    static_assert(all_the_algorithms());
+    // static_assert(all_the_algorithms());
 
     return 0;
 }
diff --git a/libcxx/test/std/algorithms/alg.nonmodifying/alg.foreach/for_each_n.pass.cpp b/libcxx/test/std/algorithms/alg.nonmodifying/alg.foreach/for_each_n.pass.cpp
index 371f6c92f1ed1..42f1a41a27096 100644
--- a/libcxx/test/std/algorithms/alg.nonmodifying/alg.foreach/for_each_n.pass.cpp
+++ b/libcxx/test/std/algorithms/alg.nonmodifying/alg.foreach/for_each_n.pass.cpp
@@ -13,69 +13,113 @@
 //    constexpr InputIterator      // constexpr after C++17
 //    for_each_n(InputIterator first, Size n, Function f);
 
-
 #include <algorithm>
 #include <cassert>
+#include <deque>
 #include <functional>
+#include <iterator>
+#include <ranges>
+#include <vector>
 
 #include "test_macros.h"
 #include "test_iterators.h"
 
-#if TEST_STD_VER > 17
-TEST_CONSTEXPR bool test_constexpr() {
-    int ia[] = {1, 3, 6, 7};
-    int expected[] = {3, 5, 8, 9};
-    const std::size_t N = 4;
+struct for_each_test {
+  TEST_CONSTEXPR for_each_test(int c) : count(c) {}
+  int count;
+  TEST_CONSTEXPR_CXX14 void operator()(int& i) {
+    ++i;
+    ++count;
+  }
+};
 
-    auto it = std::for_each_n(std::begin(ia), N, [](int &a) { a += 2; });
-    return it == (std::begin(ia) + N)
-        && std::equal(std::begin(ia), std::end(ia), std::begin(expected))
-        ;
-    }
-#endif
+struct deque_test {
+  std::deque<int>* d_;
+  int* i_;
+
+  deque_test(std::deque<int>& d, int& i) : d_(&d), i_(&i) {}
 
-struct for_each_test
-{
-    for_each_test(int c) : count(c) {}
-    int count;
-    void operator()(int& i) {++i; ++count;}
+  void operator()(int& v) {
+    assert(&(*d_)[*i_] == &v);
+    ++*i_;
+  }
 };
 
-int main(int, char**)
-{
+/*TEST_CONSTEXPR_CXX23*/
+void test_segmented_deque_iterator() { // TODO: Mark as TEST_CONSTEXPR_CXX23 once std::deque is constexpr
+  // check that segmented iterators work properly
+  int sizes[] = {0, 1, 2, 1023, 1024, 1025, 2047, 2048, 2049};
+  for (const int size : sizes) {
+    std::deque<int> d(size);
+    int index = 0;
+
+    std::for_each_n(d.begin(), d.size(), deque_test(d, index));
+  }
+}
+
+TEST_CONSTEXPR_CXX20 bool test() {
+  {
     typedef cpp17_input_iterator<int*> Iter;
-    int ia[] = {0, 1, 2, 3, 4, 5};
-    const unsigned s = sizeof(ia)/sizeof(ia[0]);
+    int ia[]         = {0, 1, 2, 3, 4, 5};
+    const unsigned s = sizeof(ia) / sizeof(ia[0]);
 
     {
-    auto f = for_each_test(0);
-    Iter it = std::for_each_n(Iter(ia), 0, std::ref(f));
-    assert(it == Iter(ia));
-    assert(f.count == 0);
+      auto f  = for_each_test(0);
+      Iter it = std::for_each_n(Iter(ia), 0, std::ref(f));
+      assert(it == Iter(ia));
+      assert(f.count == 0);
     }
 
     {
-    auto f = for_each_test(0);
-    Iter it = std::for_each_n(Iter(ia), s, std::ref(f));
+      auto f  = for_each_test(0);
+      Iter it = std::for_each_n(Iter(ia), s, std::ref(f));
 
-    assert(it == Iter(ia+s));
-    assert(f.count == s);
-    for (unsigned i = 0; i < s; ++i)
-        assert(ia[i] == static_cast<int>(i+1));
+      assert(it == Iter(ia + s));
+      assert(f.count == s);
+      for (unsigned i = 0; i < s; ++i)
+        assert(ia[i] == static_cast<int>(i + 1));
     }
 
     {
-    auto f = for_each_test(0);
-    Iter it = std::for_each_n(Iter(ia), 1, std::ref(f));
+      auto f  = for_each_test(0);
+      Iter it = std::for_each_n(Iter(ia), 1, std::ref(f));
 
-    assert(it == Iter(ia+1));
-    assert(f.count == 1);
-    for (unsigned i = 0; i < 1; ++i)
-        assert(ia[i] == static_cast<int>(i+2));
+      assert(it == Iter(ia + 1));
+      assert(f.count == 1);
+      for (unsigned i = 0; i < 1; ++i)
+        assert(ia[i] == static_cast<int>(i + 2));
     }
+  }
+
+#if TEST_STD_VER > 11
+  {
+    int ia[]            = {1, 3, 6, 7};
+    int expected[]      = {3, 5, 8, 9};
+    const std::size_t N = 4;
+
+    auto it = std::for_each_n(std::begin(ia), N, [](int& a) { a += 2; });
+    assert(it == (std::begin(ia) + N) && std::equal(std::begin(ia), std::end(ia), std::begin(expected)));
+  }
+#endif
+
+  if (!TEST_IS_CONSTANT_EVALUATED) // TODO: Use TEST_STD_AT_LEAST_23_OR_RUNTIME_EVALUATED when std::deque is made constexpr
+    test_segmented_deque_iterator();
+
+#if TEST_STD_VER >= 20
+  { // Make sure that the segmented iterator optimization works during constant evaluation
+    std::vector<std::vector<int>> vec = {{0}, {1, 2}, {3, 4, 5}, {6, 7, 8, 9}, {10}, {11, 12, 13}};
+    auto v                            = vec | std::views::join;
+    std::for_each_n(v.begin(), std::ranges::distance(v), [i = 0](int& a) mutable { assert(a == i++); });
+  }
+#endif
+
+  return true;
+}
 
+int main(int, char**) {
+  assert(test());
 #if TEST_STD_VER > 17
-    static_assert(test_constexpr());
+  static_assert(test());
 #endif
 
   return 0;
diff --git a/libcxx/test/std/algorithms/alg.nonmodifying/alg.foreach/ranges.for_each.pass.cpp b/libcxx/test/std/algorithms/alg.nonmodifying/alg.foreach/ranges.for_each.pass.cpp
index 8b9b6e82cbcb2..2f4bfb9db6dba 100644
--- a/libcxx/test/std/algorithms/alg.nonmodifying/alg.foreach/ranges.for_each.pass.cpp
+++ b/libcxx/test/std/algorithms/alg.nonmodifying/alg.foreach/ranges.for_each.pass.cpp
@@ -20,7 +20,10 @@
 
 #include <algorithm>
 #include <array>
+#include <cassert>
+#include <deque>
 #include <ranges>
+#include <vector>
 
 #include "almost_satisfies_types.h"
 #include "test_iterators.h"
@@ -30,7 +33,7 @@ struct Callable {
 };
 
 template <class Iter, class Sent = Iter>
-concept HasForEachIt = requires (Iter iter, Sent sent) { std::ranges::for_each(iter, sent, Callable{}); };
+concept HasForEachIt = requires(Iter iter, Sent sent) { std::ranges::for_each(iter, sent, Callable{}); };
 
 static_assert(HasForEachIt<int*>);
 static_assert(!HasForEachIt<InputIteratorNotDerivedFrom>);
@@ -47,7 +50,7 @@ static_assert(!HasForEachItFunc<IndirectUnaryPredicateNotPredicate>);
 static_assert(!HasForEachItFunc<IndirectUnaryPredicateNotCopyConstructible>);
 
 template <class Range>
-concept HasForEachR = requires (Range range) { std::ranges::for_each(range, Callable{}); };
+concept HasForEachR = requires(Range range) { std::ranges::for_each(range, Callable{}); };
 
 static_assert(HasForEachR<UncheckedRange<int*>>);
 static_assert(!HasForEachR<InputRangeNotDerivedFrom>);
@@ -68,7 +71,7 @@ constexpr void test_iterator() {
   { // simple test
     {
       auto func = [i = 0](int& a) mutable { a += i++; };
-      int a[] = {1, 6, 3, 4};
+      int a[]   = {1, 6, 3, 4};
       std::same_as<std::ranges::for_each_result<Iter, decltype(func)>> decltype(auto) ret =
           std::ranges::for_each(Iter(a), Sent(Iter(a + 4)), func);
       assert(a[0] == 1);
@@ -81,8 +84,8 @@ constexpr void test_iterator() {
       assert(i == 4);
     }
     {
-      auto func = [i = 0](int& a) mutable { a += i++; };
-      int a[] = {1, 6, 3, 4};
+      auto func  = [i = 0](int& a) mutable { a += i++; };
+      int a[]    = {1, 6, 3, 4};
       auto range = std::ranges::subrange(Iter(a), Sent(Iter(a + 4)));
       std::same_as<std::ranges::for_each_result<Iter, decltype(func)>> decltype(auto) ret =
           std::ranges::for_each(range, func);
@@ -110,6 +113,30 @@ constexpr void test_iterator() {
   }
 }
 
+struct deque_test {
+  std::deque<int>* d_;
+  int* i_;
+
+  deque_test(std::deque<int>& d, int& i) : d_(&d), i_(&i) {}
+
+  void operator()(int& v) {
+    assert(&(*d_)[*i_] == &v);
+    ++*i_;
+  }
+};
+
+/*TEST_CONSTEXPR_CXX23*/
+void test_segmented_deque_iterator() { // TODO: Mark as TEST_CONSTEXPR_CXX23 once std::deque is constexpr
+  // check that segmented iterators work properly
+  int sizes[] = {0, 1, 2, 1023, 1024, 1025, 2047, 2048, 2049};
+  for (const int size : sizes) {
+    std::deque<int> d(size);
+    int index = 0;
+
+    std::ranges::for_each(d, deque_test(d, index));
+  }
+}
+
 constexpr bool test() {
   test_iterator<cpp17_input_iterator<int*>, sentinel_wrapper<cpp17_input_iterator<int*>>>();
   test_iterator<cpp20_input_iterator<int*>, sentinel_wrapper<cpp20_input_iterator<int*>>>();
@@ -146,6 +173,15 @@ constexpr bool test() {
     }
   }
 
+  if (!TEST_IS_CONSTANT_EVALUATED) // TODO: Use TEST_STD_AT_LEAST_23_OR_RUNTIME_EVALUATED when std::deque is made constexpr
+    test_segmented_deque_iterator();
+
+  {
+    std::vector<std::vector<int>> vec = {{0}, {1, 2}, {3, 4, 5}, {6, 7, 8, 9}, {10}, {11, 12, 13}};
+    auto v                            = vec | std::views::join;
+    std::ranges::for_each(v, [i = 0](int x) mutable { assert(x == 2 * i++); }, [](int x) { return 2 * x; });
+  }
+
   return true;
 }
 
diff --git a/libcxx/test/std/algorithms/alg.nonmodifying/alg.foreach/ranges.for_each_n.pass.cpp b/libcxx/test/std/algorithms/alg.nonmodifying/alg.foreach/ranges.for_each_n.pass.cpp
index d4b2d053d08ce..ad1447b7348f5 100644
--- a/libcxx/test/std/algorithms/alg.nonmodifying/alg.foreach/ranges.for_each_n.pass.cpp
+++ b/libcxx/test/std/algorithms/alg.nonmodifying/alg.foreach/ranges.for_each_n.pass.cpp
@@ -17,7 +17,12 @@
 
 #include <algorithm>
 #include <array>
+#include <cassert>
+#include <deque>
+#include <iterator>
 #include <ranges>
+#include <ranges>
+#include <vector>
 
 #include "almost_satisfies_types.h"
 #include "test_iterators.h"
@@ -27,7 +32,7 @@ struct Callable {
 };
 
 template <class Iter>
-concept HasForEachN = requires (Iter iter) { std::ranges::for_each_n(iter, 0, Callable{}); };
+concept HasForEachN = requires(Iter iter) { std::ranges::for_each_n(iter, 0, Callable{}); };
 
 static_assert(HasForEachN<int*>);
 static_assert(!HasForEachN<InputIteratorNotDerivedFrom>);
@@ -45,7 +50,7 @@ template <class Iter>
 constexpr void test_iterator() {
   { // simple test
     auto func = [i = 0](int& a) mutable { a += i++; };
-    int a[] = {1, 6, 3, 4};
+    int a[]   = {1, 6, 3, 4};
     std::same_as<std::ranges::for_each_result<Iter, decltype(func)>> auto ret =
         std::ranges::for_each_n(Iter(a), 4, func);
     assert(a[0] == 1);
@@ -64,6 +69,30 @@ constexpr void test_iterator() {
   }
 }
 
+struct deque_test {
+  std::deque<int>* d_;
+  int* i_;
+
+  deque_test(std::deque<int>& d, int& i) : d_(&d), i_(&i) {}
+
+  void operator()(int& v) {
+    assert(&(*d_)[*i_] == &v);
+    ++*i_;
+  }
+};
+
+/*TEST_CONSTEXPR_CXX23*/
+void test_segmented_deque_iterator() { // TODO: Mark as TEST_CONSTEXPR_CXX23 once std::deque is constexpr
+  // check that segmented iterators work properly
+  int sizes[] = {0, 1, 2, 1023, 1024, 1025, 2047, 2048, 2049};
+  for (const int size : sizes) {
+    std::deque<int> d(size);
+    int index = 0;
+
+    std::ranges::for_each_n(d.begin(), d.size(), deque_test(d, index));
+  }
+}
+
 constexpr bool test() {
   test_iterator<cpp17_input_iterator<int*>>();
   test_iterator<cpp20_input_iterator<int*>>();
@@ -89,6 +118,19 @@ constexpr bool test() {
     assert(a[2].other == 6);
   }
 
+  if (!TEST_IS_CONSTANT_EVALUATED) // TODO: Use TEST_STD_AT_LEAST_23_OR_RUNTIME_EVALUATED when std::deque is made constexpr
+    test_segmented_deque_iterator();
+
+  {
+    std::vector<std::vector<int>> vec = {{0}, {1, 2}, {3, 4, 5}, {6, 7, 8, 9}, {10}, {11, 12, 13}};
+    auto v                            = vec | std::views::join;
+    std::ranges::for_each_n(
+        v.begin(),
+        std::ranges::distance(v),
+        [i = 0](int x) mutable { assert(x == 2 * i++); },
+        [](int x) { return 2 * x; });
+  }
+
   return true;
 }

libcxx/test/std/algorithms/alg.nonmodifying/alg.foreach/for_each_n.pass.cpp

libcxx/test/libcxx/algorithms/ranges_robust_against_copying_comparators.pass.cpp

libcxx/include/__algorithm/ranges_for_each.h

libcxx/test/std/algorithms/alg.nonmodifying/alg.foreach/for_each_n.pass.cpp

libcxx/include/__algorithm/for_each.h

ldionne

Thanks for the patch! I left some comments but I think this is going to be a nice optimization.

libcxx/test/libcxx/transitive_includes/cxx11.csv

libcxx/include/__algorithm/for_each.h

libcxx/include/__algorithm/ranges_for_each.h

libcxx/include/__algorithm/ranges_for_each_n.h

libcxx/include/__algorithm/for_each_n.h

libcxx/test/std/algorithms/alg.nonmodifying/alg.foreach/for_each_n.pass.cpp

libcxx/docs/ReleaseNotes/21.rst

libcxx/include/__algorithm/for_each_n_segment.h

libcxx/include/__algorithm/for_each.h

libcxx/test/benchmarks/algorithms/nonmodifying/for_each_join_view.bench.cpp

libcxx/test/benchmarks/algorithms/nonmodifying/for_each_n.bench.cpp

libcxx/test/std/algorithms/alg.nonmodifying/alg.foreach/for_each_n.pass.cpp

libcxx/include/__algorithm/ranges_for_each_n.h

libcxx/include/__algorithm/for_each_n.h

philnik777

I feel like the scope of this patch is getting a bit out of hand. The title says that you're optimizing ranges::for_each{,_n}, but you're also back-porting the std::for_each optimization to C++03, adding and adding an optimization to std::for_each_n. Could we split this up to make it clear what changes are required for what optimizations? Also, why do we want to back-port the std::for_each optimization now? Do we think the extra complexity is worth the improved performance?

winner245 · 2025-04-05T14:11:38Z

I feel like the scope of this patch is getting a bit out of hand. The title says that you're optimizing ranges::for_each{,_n}, but you're also back-porting the std::for_each optimization to C++03, adding and adding an optimization to std::for_each_n. Could we split this up to make it clear what changes are required for what optimizations? Also, why do we want to back-port the std::for_each optimization now? Do we think the extra complexity is worth the improved performance?

Thank you for your feedback! I agree that the scope of the patch has expanded beyond its original intent. Initially, the goal was simple: only to extend the optimization for std::for_each to its variants ranges::for_each{,_n}. However, as the review and revision progressed, I aimed to address the inconsistent segmented iterator optimization support between for_each_n and for_each, as the optimization for for_each_n includes C++03. I think back-porting the optimization for std::for_each to C++03 could be useful as we may be able to extend the optimization to other algorithms by letting them simply forward to std::for_each (as per your comment in another PR).

However, I agree that this made the patch diverge from its original purpose and may complicate the review process. Following your suggestion, I will work on splitting it to make it clear what this patch focuses on.

-------------- Update --------------
As per your suggestion, I have split this into the following PRs, each focusing on an independent and self-contained subtask for the classical algorithms:

This separation allows the current PR to focus exclusively on the optimization of the ranges algorithms. I will rebase my current patch on the above split pieces once they are landed.

github-actions · 2025-05-22T21:53:28Z

✅ With the latest revision this PR passed the C/C++ code formatter.

winner245 · 2025-06-02T10:29:10Z

With std::for_each backported to C++11 in #134960 and std::for_each_n carved out into #135468, this PR is now much cleaner, focusing exclusively on std::ranges::{for_each, for_each_n}.

ldionne

LGTM once comments are addressed. Thanks a lot for this series of refactorings / optimizations!

ldionne · 2025-06-04T16:37:51Z

libcxx/docs/ReleaseNotes/21.rst

+  resulting in performance improvements of up to 21.3x for ``std::deque::iterator`` and 24.9x for ``join_view`` of
+  ``vector<vector<char>>``.


We should report this optimization on the same line as the std::for_each optimization above -- I don't think there is much to be gained from having nearly-duplicate release notes since these algorithms are very similar. While we aim for a good level of completeness in our release notes, we also want to make them as useful to users as possible.

ldionne · 2025-06-04T16:53:56Z

libcxx/include/__algorithm/ranges_for_each.h

-    for (; __first != __last; ++__first)
-      std::invoke(__func, std::invoke(__proj, *__first));
-    return {std::move(__first), std::move(__func)};
+    if constexpr (!std::assignable_from<_Iter&, _Sent> && sized_sentinel_for<_Sent, _Iter>) {


Suggested change

if constexpr (!std::assignable_from<_Iter&, _Sent> && sized_sentinel_for<_Sent, _Iter>) {

// In the case where we have different iterator and sentinel types, the segmented iterator optimization

// in std::for_each will not kick in. Therefore, we prefer std::for_each_n in that case (whenever we can

// obtain the `n`).

if constexpr (!std::assignable_from<_Iter&, _Sent> && sized_sentinel_for<_Sent, _Iter>) {

ldionne · 2025-06-04T16:56:36Z

libcxx/test/benchmarks/algorithms/nonmodifying/for_each.bench.cpp

+          ->Arg(1024)
+          ->Arg(4096)
          ->Arg(8192)
-          ->Arg(1 << 20);
+          ->Arg(1 << 14)
+          ->Arg(1 << 16)
+          ->Arg(1 << 18);


I believe it would be better to leave the old benchmark values in place. They are less comprehensive but we need to achieve a tradeoff between comprehensiveness and the time it takes to run these benchmarks.

ldionne · 2025-06-04T16:56:45Z

libcxx/test/benchmarks/algorithms/nonmodifying/for_each.bench.cpp

+          ->Arg(8)
+          ->Arg(32)
+          ->Arg(50) // non power-of-two
+          ->Arg(1024)
+          ->Arg(4096)
+          ->Arg(8192)
+          ->Arg(1 << 14)
+          ->Arg(1 << 16)
+          ->Arg(1 << 18);


Same here for the benchmark sizes.

ldionne · 2025-06-04T16:57:46Z

libcxx/test/benchmarks/algorithms/nonmodifying/for_each_n.bench.cpp

    bm.operator()<std::list<int>>("std::for_each_n(list<int>)", std_for_each_n);
+    bm.operator()<std::vector<int>>("rng::for_each_n(vector<int>)", std::ranges::for_each_n);


Let's use the same numbers as for the std::for_each benchmarks.

frederick-vs-ja reviewed Mar 25, 2025

View reviewed changes

libcxx/include/__algorithm/for_each_n.h Show resolved Hide resolved

winner245 force-pushed the for-each-segment branch from 49011aa to ba1d5d4 Compare March 25, 2025 15:31

winner245 marked this pull request as ready for review March 25, 2025 15:59

winner245 requested a review from a team as a code owner March 25, 2025 15:59

llvmbot added the libc++ libc++ C++ Standard Library. Not GNU libstdc++. Not libc++abi. label Mar 25, 2025

winner245 added the performance label Mar 25, 2025

ldionne reviewed Mar 25, 2025

View reviewed changes

winner245 mentioned this pull request Mar 26, 2025

[libc++] P3372R3: constexpr deque #128656

Open

winner245 force-pushed the for-each-segment branch from ba1d5d4 to c113266 Compare March 26, 2025 02:03

frederick-vs-ja reviewed Mar 26, 2025

View reviewed changes

libcxx/test/std/algorithms/alg.nonmodifying/alg.foreach/for_each_n.pass.cpp Outdated Show resolved Hide resolved

winner245 force-pushed the for-each-segment branch from a7041cc to a2e451d Compare March 26, 2025 15:11

winner245 commented Mar 26, 2025

View reviewed changes

libcxx/include/__algorithm/for_each.h Outdated Show resolved Hide resolved

winner245 force-pushed the for-each-segment branch 2 times, most recently from 16438be to 047acfd Compare March 27, 2025 01:08

ldionne requested changes Mar 27, 2025

View reviewed changes

winner245 force-pushed the for-each-segment branch 3 times, most recently from 0aad396 to 5a7b6eb Compare March 29, 2025 03:59

winner245 mentioned this pull request Mar 29, 2025

[libc++] Optimize {std,ranges}::{fill,fill_n} for segmented iterators #132665

Open

ldionne requested changes Apr 2, 2025

View reviewed changes

winner245 force-pushed the for-each-segment branch from 198fe3b to f5d13ab Compare April 3, 2025 16:28

winner245 mentioned this pull request Apr 3, 2025

[libc++] Fix __segmented_iterator_traits for implicit template instantiation in SFINAE #134304

Closed

winner245 commented Apr 3, 2025

View reviewed changes

libcxx/include/__algorithm/for_each_n.h Outdated Show resolved Hide resolved

winner245 force-pushed the for-each-segment branch 3 times, most recently from d14bde4 to 8a5bcdc Compare April 5, 2025 02:43

philnik777 requested changes Apr 5, 2025

View reviewed changes

winner245 force-pushed the for-each-segment branch from 8a5bcdc to 5a225dd Compare May 22, 2025 21:50

winner245 force-pushed the for-each-segment branch from 5a225dd to b366e93 Compare May 22, 2025 22:36

winner245 added 11 commits June 1, 2025 21:47

Optimize ranges::{for_each, for_each_n} for segmented iterators

6f5e8ab

Address ldionne's review comments

296913e

Fix test and ADL call

b9a9606

Make for_each segmented iterator optimization valid for C++03

fb7748b

Allow transitive include of <optional> in affected headers

2a331a1

Remove unnecessary _AlgoPolicy template parameter

1ad983c

Apply optimization for join_view segmented iterators

5e5882b

Consistently extend segmented iterator optimization to ranges::for_each

f9278d2

Fix review comments

15f755a

Fix invoke call by using std::__invoke

c4fb935

Refactor to simplify logic of for_each_n_segment.h

216b957

winner245 force-pushed the for-each-segment branch from b366e93 to 216b957 Compare June 2, 2025 01:48

Merge branch 'main' into for-each-segment

275c254

ldionne approved these changes Jun 4, 2025

View reviewed changes

		resulting in performance improvements of up to 21.3x for ``std::deque::iterator`` and 24.9x for ``join_view`` of
		``vector<vector<char>>``.

-    if constexpr (!std::assignable_from<_Iter&, _Sent> && sized_sentinel_for<_Sent, _Iter>) {
+    // In the case where we have different iterator and sentinel types, the segmented iterator optimization
+    // in std::for_each will not kick in. Therefore, we prefer std::for_each_n in that case (whenever we can
+    // obtain the `n`).
+    if constexpr (!std::assignable_from<_Iter&, _Sent> && sized_sentinel_for<_Sent, _Iter>) {

		bm.operator()<std::list<int>>("std::for_each_n(list<int>)", std_for_each_n);
		bm.operator()<std::vector<int>>("rng::for_each_n(vector<int>)", std::ranges::for_each_n);

[libc++] Optimize ranges::{for_each, for_each_n} for segmented iterators #132896

Are you sure you want to change the base?

[libc++] Optimize ranges::{for_each, for_each_n} for segmented iterators #132896

Uh oh!

Conversation

winner245 commented Mar 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary of speedups for deque iterators

Summary of speedups for join_view iterators

Benchmarks:

{std, ranges}::for_each_n with deque iterators

{std, ranges}::for_each with deque iterators

{std, ranges}::for_each_{, n} with join_view iterators

Uh oh!

Uh oh!

llvmbot commented Mar 25, 2025

for_each_n

for_each

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ldionne left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

philnik777 left a comment

Choose a reason for hiding this comment

Uh oh!

winner245 commented Apr 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented May 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

winner245 commented Jun 2, 2025

Uh oh!

ldionne left a comment

Choose a reason for hiding this comment

Uh oh!

ldionne Jun 4, 2025

Choose a reason for hiding this comment

Uh oh!

ldionne Jun 4, 2025

Choose a reason for hiding this comment

Uh oh!

ldionne Jun 4, 2025

Choose a reason for hiding this comment

Uh oh!

ldionne Jun 4, 2025

Choose a reason for hiding this comment

Uh oh!

ldionne Jun 4, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

winner245 commented Mar 25, 2025 •

edited

Loading

Summary of speedups for `deque` iterators

Summary of speedups for `join_view` iterators

`{std, ranges}::for_each_n` with `deque` iterators

`{std, ranges}::for_each` with `deque` iterators

`{std, ranges}::for_each_{, n}` with `join_view` iterators

`for_each_n`

`for_each`

winner245 commented Apr 5, 2025 •

edited

Loading

github-actions bot commented May 22, 2025 •

edited

Loading