Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vectorize search_n for small values of n #5352

Open
wants to merge 21 commits into
base: main
Choose a base branch
from

Conversation

AlexGuteniev
Copy link
Contributor

@AlexGuteniev AlexGuteniev commented Mar 22, 2025

⚙️ The optimization

Like I mentioned in #5346, both std::search_n and ranges::search_n make steps by n elements, and avoid going back for a good input (where there are few potential matches), so for large n values vectorization wouldn't be an improvement.

Still for small n, such that vector register width is larger than n, and therefore, the vector step is bigger, it is possible to vectorize in a way that would be faster even for an input with few matches. For more matches, such vectorization will have more advantage, as it would not need to go back.

The approach is to compare elements, get bit mask, and look for contiguous set of ones of proper length. @Alcaro suggested:

you can do things like

match &= match>>1
match &= match>>2
match &= match>>3

to find 7 consecutive 1s, but that probably does something ruinous to instruction parallelism and may cost more than it saves

Turns out this is efficient enough for AVX2 with the values of n up twice smaller than AVX register size in elements. Despite there seems to be indeed high cost of ruined parallelism, I cannot find anything faster.

The shift values are computed based on n. To save one variable (general purpose register), we rely on n=1 to be handled separately, and assume at least one shift to happen.

To deal with matches on vector register boundary, the bitmask is concatenated with the previous one. AVX bitmask is 32 bits for 32 bytes of AVX value, doubled it is 64 bit, still fits x64 register perfectly. The alternative to concatenation could be handling the boundary case with lzcnt/tzcnt, this turned out to be not faster.

The fallback is used for tails and too large n values. For tails it uses lzcnt with inverted carry value to have smooth transition from potential partial match in vectors to the scalar part. The fallback recreates ranges::search_n in <algorithm>, with slight variation.

🥔 Down-level architectures support

SSE4.2 version is implementable in both senses of backporting the current approach to SSE and using pcmpestri. I'd expect either to be of advantage for n values twice smaller than SSE register. Just feel like should not bother trying that.

x86 version works the same way as x64. However, unlike many other vectorization algorithms, this one relies a lot on general-purpose 64 bit integer ops. To mitigate the impact __ull_rshift is used instead of the plain shift. This intrinsic usage doesn't impact 64-bit code, but makes 32-bit codegen better (at the expense of not handling huge shifts, which we don't need anyway). The shift values are of int type to match the intrinsic parameter type.

Still, the efficiency on x86 is questionable (see benchmark results below). Apart from having shifts in multiple instructions, it is apparently due to general purpose registers deficit. The compiler isn't being helpful here too, some register spills look superfluous.

For 32-bit and 64-bit elements, it is possible to use the floating point bit mask, instead of integer bit mask, like in #4987/#5092. This will save bit width. But apart from the mysterious "bypass delay" (integers and floats instructions mix potential penalty), it will also make the bit magic more complicated, more dependent on element width, and still won't reduce the bit width for 8-bit and 16-bit elements, so this doesn't seem to be worth doing.

We could just skip x86. But we don't have precedent of having vectorization for x64, but not having it for x86, so I didn't want to introduce one.

1️⃣ Special n=1 case

We need to handle this case as just find vectorization. find vectorization is more efficient than this one, plus the assumption that the shift happens at least once saves a variable/register.

The question is where we should handle this:

  1. Only in separately compiled code
  2. Only in headers
  3. Both in headers and in separately compiled code

The latter two are indistinguishable in practice, so the real question is, if we should:

  1. Handle it in separately compiled code, effectively reverting Use find for search_n when n=1 #5346 optimization
  2. Keep handling it in header

With removal n=1 case from headers we get:

  • Better throughput
  • Simpler header implementation

With keeping n=1 case in headers we get:

  • Some non-vectorization optimization for non-vector element types (I believe it is noticeable, but not like multiple times)
  • Some auto-vectorization from Clang and probably MSVC in future (Clang recognizes find pattern)
  • memchr for corresponding type and disabled vectorization mode

✅ Test coverage

To cover the variety of possibilities, the randomized test should try different input lengths, different n, and different actual matches lengths (including too long matches, too short matches, and different gap between matches). This has to have long run time, so it deserves a dedicated test.

The test coverage is not only useful for vectorization, it also compensates missing non-vectorization coverage, asked in #933.

This PR still doesn't fully address #933 as it is asked because:

  • It does not cover the forward-only iterator branches
  • It does not have features, like nice error case print, or seed parameter acceptance

I'm not sure how much these features are required, though. If they are required, further work to complete #933 would certainly need a different PR.

🏁 Benchmarks

In addition to the TwoZones case inherited from ##5346 , it has DenseSmallSequences.

These two are close to normal case and worst case respectively.

TwoZones (Zones in the table below) has half of range with mismatch character and half of rangers with match character. So the search should quickly proceed to the match part then check the first match which is successful.

DenseSmallSequences (Dense in the table below) has too short matches of random with from 0 to n-1 interrupted by a single mismatch character.

The vectorization improvement is more for DenseSmallSequences, but we should probably care about TwoZones somewhat more. If worst case is a priority, we can lift threshold for the vectorization twice.

⏱️ Benchmark results

Benchmark V. alg? x64 Before x64 After x64 ++ x86 Before x86 After x86 ++
u8/Std/Zones/3000/40 no 45.4 ns 46.7 ns 0.97 42.4 ns 66.8 ns 0.63
u8/Std/Zones/3000/18 no 63.0 ns 61.8 ns 1.02 62.1 ns 83.2 ns 0.75
u8/Std/Zones/3000/16 yes 66.0 ns 69.3 ns 0.95 77.8 ns 128 ns 0.61
u8/Std/Zones/3000/14 yes 68.7 ns 69.0 ns 1.00 71.7 ns 129 ns 0.56
u8/Std/Zones/3000/10 yes 85.8 ns 72.4 ns 1.19 93.9 ns 130 ns 0.72
u8/Std/Zones/3000/8 yes 103 ns 69.4 ns 1.48 113 ns 128 ns 0.88
u8/Std/Zones/3000/5 yes 157 ns 74.2 ns 2.12 171 ns 128 ns 1.34
u8/Std/Zones/3000/4 yes 189 ns 72.4 ns 2.61 210 ns 125 ns 1.68
u8/Std/Zones/3000/3 yes 250 ns 71.6 ns 3.49 272 ns 132 ns 2.06
u8/Std/Zones/3000/2 yes 368 ns 72.5 ns 5.08 402 ns 130 ns 3.09
u8/Std/Zones/3000/1 find 18.0 ns 18.2 ns 0.99 18.3 ns 21.8 ns 0.84
u8/Rng/Zones/3000/40 no 47.7 ns 45.8 ns 1.04 52.4 ns 66.9 ns 0.78
u8/Rng/Zones/3000/18 no 78.2 ns 60.8 ns 1.29 79.7 ns 83.7 ns 0.95
u8/Rng/Zones/3000/16 yes 84.9 ns 71.1 ns 1.19 85.6 ns 129 ns 0.66
u8/Rng/Zones/3000/14 yes 90.3 ns 71.4 ns 1.26 93.7 ns 128 ns 0.73
u8/Rng/Zones/3000/10 yes 118 ns 72.3 ns 1.63 118 ns 128 ns 0.92
u8/Rng/Zones/3000/8 yes 141 ns 71.7 ns 1.97 144 ns 128 ns 1.13
u8/Rng/Zones/3000/5 yes 215 ns 75.5 ns 2.85 212 ns 125 ns 1.70
u8/Rng/Zones/3000/4 yes 303 ns 72.9 ns 4.16 265 ns 129 ns 2.05
u8/Rng/Zones/3000/3 yes 346 ns 73.8 ns 4.69 344 ns 130 ns 2.65
u8/Rng/Zones/3000/2 yes 509 ns 74.8 ns 6.80 506 ns 129 ns 3.92
u8/Rng/Zones/3000/1 find 18.2 ns 18.4 ns 0.99 18.5 ns 18.7 ns 0.99
u8/Std/Dense/3000/40 no 818 ns 381 ns 2.15 823 ns 654 ns 1.26
u8/Std/Dense/3000/18 no 1006 ns 501 ns 2.01 1036 ns 774 ns 1.34
u8/Std/Dense/3000/16 yes 985 ns 135 ns 7.30 1022 ns 236 ns 4.33
u8/Std/Dense/3000/14 yes 987 ns 136 ns 7.26 1004 ns 244 ns 4.11
u8/Std/Dense/3000/10 yes 1071 ns 144 ns 7.44 1094 ns 245 ns 4.47
u8/Std/Dense/3000/8 yes 1140 ns 138 ns 8.26 1239 ns 246 ns 5.04
u8/Std/Dense/3000/5 yes 1301 ns 147 ns 8.85 1356 ns 279 ns 4.86
u8/Std/Dense/3000/4 yes 1303 ns 147 ns 8.86 1418 ns 243 ns 5.84
u8/Std/Dense/3000/3 yes 1300 ns 147 ns 8.84 1460 ns 248 ns 5.89
u8/Std/Dense/3000/2 yes 1191 ns 149 ns 7.99 1363 ns 244 ns 5.59
u8/Std/Dense/3000/1 find 49.4 ns 47.0 ns 1.05 48.3 ns 49.6 ns 0.97
u8/Rng/Dense/3000/40 no 830 ns 382 ns 2.17 584 ns 653 ns 0.89
u8/Rng/Dense/3000/18 no 813 ns 506 ns 1.61 622 ns 768 ns 0.81
u8/Rng/Dense/3000/16 yes 853 ns 143 ns 5.97 660 ns 237 ns 2.78
u8/Rng/Dense/3000/14 yes 843 ns 137 ns 6.15 665 ns 241 ns 2.76
u8/Rng/Dense/3000/10 yes 875 ns 138 ns 6.34 707 ns 243 ns 2.91
u8/Rng/Dense/3000/8 yes 936 ns 139 ns 6.73 771 ns 240 ns 3.21
u8/Rng/Dense/3000/5 yes 1057 ns 148 ns 7.14 858 ns 240 ns 3.58
u8/Rng/Dense/3000/4 yes 1155 ns 148 ns 7.80 876 ns 248 ns 3.53
u8/Rng/Dense/3000/3 yes 1240 ns 147 ns 8.44 889 ns 252 ns 3.53
u8/Rng/Dense/3000/2 yes 1096 ns 149 ns 7.36 1074 ns 251 ns 4.28
u8/Rng/Dense/3000/1 find 51.6 ns 49.4 ns 1.04 48.9 ns 48.6 ns 1.01
u16/Std/Zones/3000/40 no 41.2 ns 50.2 ns 0.82 46.1 ns 55.3 ns 0.83
u16/Std/Zones/3000/18 no 66.3 ns 69.2 ns 0.96 68.1 ns 76.4 ns 0.89
u16/Std/Zones/3000/16 no 71.0 ns 75.8 ns 0.94 75.3 ns 83.0 ns 0.91
u16/Std/Zones/3000/14 no 77.3 ns 83.5 ns 0.93 79.9 ns 92.0 ns 0.87
u16/Std/Zones/3000/10 no 97.1 ns 105 ns 0.92 103 ns 116 ns 0.89
u16/Std/Zones/3000/8 yes 117 ns 107 ns 1.09 126 ns 175 ns 0.72
u16/Std/Zones/3000/5 yes 166 ns 107 ns 1.55 195 ns 174 ns 1.12
u16/Std/Zones/3000/4 yes 194 ns 107 ns 1.81 231 ns 173 ns 1.34
u16/Std/Zones/3000/3 yes 270 ns 117 ns 2.31 309 ns 177 ns 1.75
u16/Std/Zones/3000/2 yes 385 ns 118 ns 3.26 438 ns 172 ns 2.55
u16/Std/Zones/3000/1 find 48.2 ns 48.9 ns 0.99 37.5 ns 52.1 ns 0.72
u16/Rng/Zones/3000/40 no 49.1 ns 49.7 ns 0.99 50.1 ns 55.0 ns 0.91
u16/Rng/Zones/3000/18 no 85.7 ns 70.1 ns 1.22 107 ns 76.6 ns 1.40
u16/Rng/Zones/3000/16 no 95.8 ns 81.1 ns 1.18 117 ns 83.9 ns 1.39
u16/Rng/Zones/3000/14 no 108 ns 84.1 ns 1.28 128 ns 91.5 ns 1.40
u16/Rng/Zones/3000/10 no 156 ns 103 ns 1.51 168 ns 115 ns 1.46
u16/Rng/Zones/3000/8 yes 185 ns 108 ns 1.71 202 ns 172 ns 1.17
u16/Rng/Zones/3000/5 yes 304 ns 108 ns 2.81 313 ns 171 ns 1.83
u16/Rng/Zones/3000/4 yes 377 ns 106 ns 3.56 394 ns 174 ns 2.26
u16/Rng/Zones/3000/3 yes 500 ns 118 ns 4.24 518 ns 172 ns 3.01
u16/Rng/Zones/3000/2 yes 734 ns 118 ns 6.22 747 ns 173 ns 4.32
u16/Rng/Zones/3000/1 find 47.3 ns 48.9 ns 0.97 37.8 ns 51.1 ns 0.74
u16/Std/Dense/3000/40 no 827 ns 385 ns 2.15 854 ns 422 ns 2.02
u16/Std/Dense/3000/18 no 964 ns 499 ns 1.93 994 ns 554 ns 1.79
u16/Std/Dense/3000/16 no 1019 ns 528 ns 1.93 1010 ns 568 ns 1.78
u16/Std/Dense/3000/14 no 1043 ns 577 ns 1.81 1064 ns 585 ns 1.82
u16/Std/Dense/3000/10 no 1202 ns 695 ns 1.73 1186 ns 708 ns 1.68
u16/Std/Dense/3000/8 yes 1308 ns 211 ns 6.20 1268 ns 337 ns 3.76
u16/Std/Dense/3000/5 yes 1514 ns 210 ns 7.21 1490 ns 340 ns 4.38
u16/Std/Dense/3000/4 yes 1494 ns 211 ns 7.08 1458 ns 355 ns 4.11
u16/Std/Dense/3000/3 yes 1438 ns 232 ns 6.20 1398 ns 365 ns 3.83
u16/Std/Dense/3000/2 yes 1136 ns 232 ns 4.90 1271 ns 346 ns 3.67
u16/Std/Dense/3000/1 find 74.3 ns 76.0 ns 0.98 87.2 ns 89.2 ns 0.98
u16/Rng/Dense/3000/40 no 423 ns 388 ns 1.09 526 ns 524 ns 1.00
u16/Rng/Dense/3000/18 no 549 ns 506 ns 1.08 643 ns 702 ns 0.92
u16/Rng/Dense/3000/16 no 576 ns 528 ns 1.09 702 ns 619 ns 1.13
u16/Rng/Dense/3000/14 no 599 ns 576 ns 1.04 701 ns 634 ns 1.11
u16/Rng/Dense/3000/10 no 699 ns 702 ns 1.00 779 ns 764 ns 1.02
u16/Rng/Dense/3000/8 yes 769 ns 216 ns 3.56 894 ns 347 ns 2.58
u16/Rng/Dense/3000/5 yes 874 ns 211 ns 4.14 1002 ns 341 ns 2.94
u16/Rng/Dense/3000/4 yes 1023 ns 210 ns 4.87 1110 ns 339 ns 3.27
u16/Rng/Dense/3000/3 yes 1320 ns 232 ns 5.69 1260 ns 344 ns 3.66
u16/Rng/Dense/3000/2 yes 1823 ns 233 ns 7.82 1769 ns 344 ns 5.14
u16/Rng/Dense/3000/1 find 75.7 ns 73.7 ns 1.03 83.0 ns 90.3 ns 0.92
u32/Std/Zones/3000/40 no 44.4 ns 43.7 ns 1.02 45.3 ns 58.1 ns 0.78
u32/Std/Zones/3000/18 no 61.9 ns 64.2 ns 0.96 71.3 ns 79.4 ns 0.90
u32/Std/Zones/3000/16 no 64.6 ns 69.5 ns 0.93 74.8 ns 84.0 ns 0.89
u32/Std/Zones/3000/14 no 72.6 ns 76.1 ns 0.95 81.1 ns 103 ns 0.79
u32/Std/Zones/3000/10 no 90.9 ns 96.3 ns 0.94 103 ns 129 ns 0.80
u32/Std/Zones/3000/8 no 113 ns 116 ns 0.97 126 ns 153 ns 0.82
u32/Std/Zones/3000/5 no 167 ns 176 ns 0.95 186 ns 237 ns 0.78
u32/Std/Zones/3000/4 yes 196 ns 162 ns 1.21 228 ns 230 ns 0.99
u32/Std/Zones/3000/3 yes 262 ns 162 ns 1.62 302 ns 232 ns 1.30
u32/Std/Zones/3000/2 yes 393 ns 163 ns 2.41 440 ns 229 ns 1.92
u32/Std/Zones/3000/1 find 80.1 ns 80.3 ns 1.00 70.5 ns 75.8 ns 0.93
u32/Rng/Zones/3000/40 no 49.2 ns 42.4 ns 1.16 52.4 ns 53.4 ns 0.98
u32/Rng/Zones/3000/18 no 101 ns 59.0 ns 1.71 100 ns 77.2 ns 1.30
u32/Rng/Zones/3000/16 no 110 ns 68.7 ns 1.60 110 ns 82.0 ns 1.34
u32/Rng/Zones/3000/14 no 125 ns 75.9 ns 1.65 122 ns 101 ns 1.21
u32/Rng/Zones/3000/10 no 159 ns 95.9 ns 1.66 162 ns 127 ns 1.28
u32/Rng/Zones/3000/8 no 194 ns 118 ns 1.64 198 ns 154 ns 1.29
u32/Rng/Zones/3000/5 no 302 ns 175 ns 1.73 313 ns 232 ns 1.35
u32/Rng/Zones/3000/4 yes 374 ns 163 ns 2.29 381 ns 231 ns 1.65
u32/Rng/Zones/3000/3 yes 494 ns 163 ns 3.03 511 ns 232 ns 2.20
u32/Rng/Zones/3000/2 yes 732 ns 162 ns 4.52 756 ns 233 ns 3.24
u32/Rng/Zones/3000/1 find 80.4 ns 80.3 ns 1.00 70.6 ns 74.3 ns 0.95
u32/Std/Dense/3000/40 no 921 ns 360 ns 2.56 821 ns 534 ns 1.54
u32/Std/Dense/3000/18 no 1171 ns 455 ns 2.57 995 ns 593 ns 1.68
u32/Std/Dense/3000/16 no 1187 ns 475 ns 2.50 978 ns 624 ns 1.57
u32/Std/Dense/3000/14 no 1212 ns 509 ns 2.38 1000 ns 659 ns 1.52
u32/Std/Dense/3000/10 no 1337 ns 605 ns 2.21 1059 ns 865 ns 1.22
u32/Std/Dense/3000/8 no 1463 ns 689 ns 2.12 1119 ns 952 ns 1.18
u32/Std/Dense/3000/5 no 1547 ns 849 ns 1.82 1268 ns 1131 ns 1.12
u32/Std/Dense/3000/4 yes 1460 ns 334 ns 4.37 1332 ns 470 ns 2.83
u32/Std/Dense/3000/3 yes 1442 ns 333 ns 4.33 1406 ns 475 ns 2.96
u32/Std/Dense/3000/2 yes 1149 ns 333 ns 3.45 1211 ns 477 ns 2.54
u32/Std/Dense/3000/1 find 163 ns 158 ns 1.03 151 ns 153 ns 0.99
u32/Rng/Dense/3000/40 no 496 ns 357 ns 1.39 553 ns 537 ns 1.03
u32/Rng/Dense/3000/18 no 600 ns 458 ns 1.31 672 ns 585 ns 1.15
u32/Rng/Dense/3000/16 no 638 ns 473 ns 1.35 665 ns 623 ns 1.07
u32/Rng/Dense/3000/14 no 665 ns 511 ns 1.30 687 ns 664 ns 1.03
u32/Rng/Dense/3000/10 no 777 ns 613 ns 1.27 784 ns 856 ns 0.92
u32/Rng/Dense/3000/8 no 857 ns 688 ns 1.25 873 ns 961 ns 0.91
u32/Rng/Dense/3000/5 no 991 ns 852 ns 1.16 1013 ns 1127 ns 0.90
u32/Rng/Dense/3000/4 yes 1105 ns 343 ns 3.22 1050 ns 470 ns 2.23
u32/Rng/Dense/3000/3 yes 1258 ns 337 ns 3.73 1275 ns 474 ns 2.69
u32/Rng/Dense/3000/2 yes 1751 ns 337 ns 5.20 1863 ns 470 ns 3.96
u32/Rng/Dense/3000/1 find 160 ns 159 ns 1.01 157 ns 152 ns 1.03
u64/Std/Zones/3000/40 no 40.9 ns 50.5 ns 0.81 48.3 ns 54.5 ns 0.89
u64/Std/Zones/3000/18 no 58.2 ns 74.7 ns 0.78 68.8 ns 76.3 ns 0.90
u64/Std/Zones/3000/16 no 63.3 ns 82.8 ns 0.76 73.9 ns 83.7 ns 0.88
u64/Std/Zones/3000/14 no 68.0 ns 90.9 ns 0.75 79.9 ns 92.9 ns 0.86
u64/Std/Zones/3000/10 no 87.0 ns 114 ns 0.76 102 ns 118 ns 0.86
u64/Std/Zones/3000/8 no 106 ns 140 ns 0.76 124 ns 143 ns 0.87
u64/Std/Zones/3000/5 no 168 ns 209 ns 0.80 187 ns 223 ns 0.84
u64/Std/Zones/3000/4 no 192 ns 259 ns 0.74 230 ns 292 ns 0.79
u64/Std/Zones/3000/3 no 266 ns 332 ns 0.80 295 ns 354 ns 0.83
u64/Std/Zones/3000/2 yes 372 ns 248 ns 1.50 434 ns 395 ns 1.10
u64/Std/Zones/3000/1 find 152 ns 151 ns 1.01 157 ns 158 ns 0.99
u64/Rng/Zones/3000/40 no 66.2 ns 49.5 ns 1.34 59.0 ns 55.9 ns 1.06
u64/Rng/Zones/3000/18 no 105 ns 73.8 ns 1.42 101 ns 74.9 ns 1.35
u64/Rng/Zones/3000/16 no 117 ns 81.2 ns 1.44 111 ns 83.4 ns 1.33
u64/Rng/Zones/3000/14 no 130 ns 90.4 ns 1.44 126 ns 91.9 ns 1.37
u64/Rng/Zones/3000/10 no 171 ns 112 ns 1.53 161 ns 118 ns 1.36
u64/Rng/Zones/3000/8 no 209 ns 137 ns 1.53 201 ns 141 ns 1.43
u64/Rng/Zones/3000/5 no 325 ns 204 ns 1.59 312 ns 211 ns 1.48
u64/Rng/Zones/3000/4 no 402 ns 251 ns 1.60 381 ns 262 ns 1.45
u64/Rng/Zones/3000/3 no 531 ns 333 ns 1.59 501 ns 346 ns 1.45
u64/Rng/Zones/3000/2 yes 796 ns 242 ns 3.29 746 ns 357 ns 2.09
u64/Rng/Zones/3000/1 find 149 ns 150 ns 0.99 152 ns 150 ns 1.01
u64/Std/Dense/3000/40 no 936 ns 384 ns 2.44 1172 ns 578 ns 2.03
u64/Std/Dense/3000/18 no 1122 ns 545 ns 2.06 1244 ns 601 ns 2.07
u64/Std/Dense/3000/16 no 1177 ns 545 ns 2.16 1239 ns 622 ns 1.99
u64/Std/Dense/3000/14 no 1208 ns 597 ns 2.02 1225 ns 632 ns 1.94
u64/Std/Dense/3000/10 no 1320 ns 766 ns 1.72 1293 ns 666 ns 1.94
u64/Std/Dense/3000/8 no 1426 ns 910 ns 1.57 1395 ns 735 ns 1.90
u64/Std/Dense/3000/5 no 1582 ns 1075 ns 1.47 1606 ns 966 ns 1.66
u64/Std/Dense/3000/4 no 1488 ns 1219 ns 1.22 1673 ns 1095 ns 1.53
u64/Std/Dense/3000/3 no 1506 ns 1296 ns 1.16 1702 ns 1126 ns 1.51
u64/Std/Dense/3000/2 yes 1464 ns 470 ns 3.11 1457 ns 689 ns 2.11
u64/Std/Dense/3000/1 find 285 ns 303 ns 0.94 288 ns 291 ns 0.99
u64/Rng/Dense/3000/40 no 576 ns 384 ns 1.50 483 ns 578 ns 0.84
u64/Rng/Dense/3000/18 no 876 ns 546 ns 1.60 521 ns 590 ns 0.88
u64/Rng/Dense/3000/16 no 866 ns 540 ns 1.60 546 ns 624 ns 0.88
u64/Rng/Dense/3000/14 no 883 ns 593 ns 1.49 559 ns 625 ns 0.89
u64/Rng/Dense/3000/10 no 944 ns 773 ns 1.22 632 ns 674 ns 0.94
u64/Rng/Dense/3000/8 no 1052 ns 914 ns 1.15 743 ns 731 ns 1.02
u64/Rng/Dense/3000/5 no 1071 ns 1079 ns 0.99 970 ns 950 ns 1.02
u64/Rng/Dense/3000/4 no 1136 ns 1224 ns 0.93 1123 ns 1096 ns 1.02
u64/Rng/Dense/3000/3 no 1303 ns 1282 ns 1.02 1317 ns 1130 ns 1.17
u64/Rng/Dense/3000/2 yes 1761 ns 474 ns 3.72 1801 ns 691 ns 2.61
u64/Rng/Dense/3000/1 find 286 ns 302 ns 0.95 290 ns 296 ns 0.98

🥈 Results interpretation

For x64 and for the vectorized n there is a certain improvement for Zones. For Dense the improvement is even greater.

The non-vectorized cases vary a lot, The fallback happen to be faster than header implementation often, but not always. Out of the header implementations, surprisingly, the ranges one is slower for Zones case.

The x86 results are not very good, but not too bad either.

The table contains a lot of rows, but I don't see a reasonable way to reduce it without losing important information.

@StephanTLavavej StephanTLavavej moved this from Initial Review to Work In Progress in STL Code Reviews Mar 22, 2025
@StephanTLavavej StephanTLavavej added the performance Must go faster label Mar 22, 2025
@AlexGuteniev

This comment was marked as resolved.

@AlexGuteniev AlexGuteniev marked this pull request as ready for review March 25, 2025 05:29
@AlexGuteniev AlexGuteniev requested a review from a team as a code owner March 25, 2025 05:29
# Conflicts:
#	benchmarks/src/search_n.cpp
#	stl/inc/algorithm
#	stl/src/vector_algorithms.cpp
@StephanTLavavej StephanTLavavej self-assigned this Mar 25, 2025
@StephanTLavavej StephanTLavavej moved this from Work In Progress to Initial Review in STL Code Reviews Mar 25, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Must go faster
Projects
Status: Initial Review
Development

Successfully merging this pull request may close these issues.

2 participants