-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Vectorize search_n
for small values of n
#5352
Open
AlexGuteniev
wants to merge
21
commits into
microsoft:main
Choose a base branch
from
AlexGuteniev:search_n
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This comment was marked as resolved.
This comment was marked as resolved.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
⚙️ The optimization
Like I mentioned in #5346, both
std::search_n
andranges::search_n
make steps by n elements, and avoid going back for a good input (where there are few potential matches), so for large n values vectorization wouldn't be an improvement.Still for small n, such that vector register width is larger than n, and therefore, the vector step is bigger, it is possible to vectorize in a way that would be faster even for an input with few matches. For more matches, such vectorization will have more advantage, as it would not need to go back.
The approach is to compare elements, get bit mask, and look for contiguous set of ones of proper length. @Alcaro suggested:
Turns out this is efficient enough for AVX2 with the values of n up twice smaller than AVX register size in elements. Despite there seems to be indeed high cost of ruined parallelism, I cannot find anything faster.
The shift values are computed based on n. To save one variable (general purpose register), we rely on n=1 to be handled separately, and assume at least one shift to happen.
To deal with matches on vector register boundary, the bitmask is concatenated with the previous one. AVX bitmask is 32 bits for 32 bytes of AVX value, doubled it is 64 bit, still fits x64 register perfectly. The alternative to concatenation could be handling the boundary case with
lzcnt
/tzcnt
, this turned out to be not faster.The fallback is used for tails and too large n values. For tails it uses
lzcnt
with inverted carry value to have smooth transition from potential partial match in vectors to the scalar part. The fallback recreatesranges::search_n
in<algorithm>
, with slight variation.🥔 Down-level architectures support
SSE4.2 version is implementable in both senses of backporting the current approach to SSE and using
pcmpestri
. I'd expect either to be of advantage for n values twice smaller than SSE register. Just feel like should not bother trying that.x86 version works the same way as x64. However, unlike many other vectorization algorithms, this one relies a lot on general-purpose 64 bit integer ops. To mitigate the impact
__ull_rshift
is used instead of the plain shift. This intrinsic usage doesn't impact 64-bit code, but makes 32-bit codegen better (at the expense of not handling huge shifts, which we don't need anyway). The shift values are ofint
type to match the intrinsic parameter type.Still, the efficiency on x86 is questionable (see benchmark results below). Apart from having shifts in multiple instructions, it is apparently due to general purpose registers deficit. The compiler isn't being helpful here too, some register spills look superfluous.
For 32-bit and 64-bit elements, it is possible to use the floating point bit mask, instead of integer bit mask, like in #4987/#5092. This will save bit width. But apart from the mysterious "bypass delay" (integers and floats instructions mix potential penalty), it will also make the bit magic more complicated, more dependent on element width, and still won't reduce the bit width for 8-bit and 16-bit elements, so this doesn't seem to be worth doing.
We could just skip x86. But we don't have precedent of having vectorization for x64, but not having it for x86, so I didn't want to introduce one.
1️⃣ Special n=1 case
We need to handle this case as just
find
vectorization.find
vectorization is more efficient than this one, plus the assumption that the shift happens at least once saves a variable/register.The question is where we should handle this:
The latter two are indistinguishable in practice, so the real question is, if we should:
find
forsearch_n
when n=1 #5346 optimizationWith removal n=1 case from headers we get:
With keeping n=1 case in headers we get:
find
pattern)memchr
for corresponding type and disabled vectorization mode✅ Test coverage
To cover the variety of possibilities, the randomized test should try different input lengths, different n, and different actual matches lengths (including too long matches, too short matches, and different gap between matches). This has to have long run time, so it deserves a dedicated test.
The test coverage is not only useful for vectorization, it also compensates missing non-vectorization coverage, asked in #933.
This PR still doesn't fully address #933 as it is asked because:
I'm not sure how much these features are required, though. If they are required, further work to complete #933 would certainly need a different PR.
🏁 Benchmarks
In addition to the
TwoZones
case inherited from ##5346 , it hasDenseSmallSequences
.These two are close to normal case and worst case respectively.
TwoZones
(Zones in the table below) has half of range with mismatch character and half of rangers with match character. So the search should quickly proceed to the match part then check the first match which is successful.DenseSmallSequences
(Dense in the table below) has too short matches of random with from 0 to n-1 interrupted by a single mismatch character.The vectorization improvement is more for
DenseSmallSequences
, but we should probably care aboutTwoZones
somewhat more. If worst case is a priority, we can lift threshold for the vectorization twice.⏱️ Benchmark results
🥈 Results interpretation
For x64 and for the vectorized n there is a certain improvement for Zones. For Dense the improvement is even greater.
The non-vectorized cases vary a lot, The fallback happen to be faster than header implementation often, but not always. Out of the header implementations, surprisingly, the ranges one is slower for Zones case.
The x86 results are not very good, but not too bad either.
The table contains a lot of rows, but I don't see a reasonable way to reduce it without losing important information.