Vectorize `remove_copy` and `unique_copy` #5355

AlexGuteniev · 2025-03-23T14:39:49Z

⚙️ The optimization

remove_copy and unique_copy are different from their non-_copy counterparts in that they don't have room they are allowed to overwrite. This means we can't directly store results from vector registers.

The previous attempt #5062 tried to use masked stores to bypass that limitation. Unfortunately, this doesn't perform well for some CPUs. Also the minimum granularity of AVX2 masked store is 32 bits, so it would not work for smaller elements.

This time, temporary storage comes to rescue. The algorithms already use some additional memory (the tables), so why wouldn't it use a bit more. I arbitrarily picked 512 bytes, should be not too much. Each time the temporary buffer is full, it can be copied to the destination with memcpy, it should be fast enough for this buffer size.

🚫 No `find` before `remove_copy`

In #4987, it was explained that doing find before remove is good for both correctness and performance. Originally it was in vectorization code, but during the review @StephanTLavavej observed that it is done in the headers already (#4987 (comment)).

For remove_copy it is not necessary for correctness, and may be harmful for performance. find would need copy in addition, this will be double pass on the input, which can make the performance worse for large input and memry-bound situation.

We may have special handling of the range before the first match in vectorization code, this is another story, and it would not be harmful, but I'm not doing this in the current PR. Maybe later.

So, as we have not called find, and so have not checked if value type can even match iterator value type, we need this _Could_compare_equal_to_value_type check here.

✅ Test coverage

Shared with non-_copy counterparts to save total tests run time and some lines of code, at the expense with otherwise unnecessary coupling.

We check both modified and unmodified destination parts, to make sure unmodified indeed didn't modify.

⏱️ Benchmark results

Benchmark	Before	After	Speedup
rc<alg_type::std_fn. std::uint8_t>	908 ns	349 ns	2.60
rc<alg_type::std_fn. std::uint16_t>	1850 ns	462 ns	4.00
rc<alg_type::std_fn. std::uint32_t>	901 ns	532 ns	1.69
rc<alg_type::std_fn. std::uint64_t>	1876 ns	1018 ns	1.84
rc<alg_type::rng. std::uint8_t>	1344 ns	349 ns	3.85
rc<alg_type::rng. std::uint16_t>	2094 ns	465 ns	4.50
rc<alg_type::rng. std::uint32_t>	884 ns	460 ns	1.92
rc<alg_type::rng. std::uint64_t>	1884 ns	1079 ns	1.75
uc<alg_type::std_fn. std::uint8_t>	3329 ns	263 ns	12.66 🤡
uc<alg_type::std_fn. std::uint16_t>	1145 ns	342 ns	3.35
uc<alg_type::std_fn. std::uint32_t>	1144 ns	388 ns	2.95
uc<alg_type::std_fn. std::uint64_t>	1128 ns	754 ns	1.50
uc<alg_type::rng. std::uint8_t>	1111 ns	252 ns	4.41
uc<alg_type::rng. std::uint16_t>	1328 ns	331 ns	4.01
uc<alg_type::rng. std::uint32_t>	1313 ns	386 ns	3.40
uc<alg_type::rng. std::uint64_t>	1146 ns	758 ns	1.51

🥇 Results interpretation

Good improvement!

Not as good as for non-_copy counterparts though, as memcpy takes some noticeable time.

The usual codegen gremlins that cause results variation are observed for non-vectorized tight loops. I've marked the most notorious one with clown. I can't explain that anomality.

Less error prone, especially if implementing _copy someday

# Conflicts: # benchmarks/src/unique.cpp # stl/inc/algorithm # stl/src/vector_algorithms.cpp

stl/inc/algorithm

AlexGuteniev and others added 20 commits November 16, 2024 22:48

unique vectorization

cffb1e7

no point

a0b714d

Not unique problem

cccf693

Pointed out coverage

54781db

Deduplicate

fa4ff20

Less error prone, especially if implementing _copy someday

Mention unique shuffling requirement

407897e

whitespace

4ca596b

Merge branch 'main' into unique

54b2938

Include <type_traits> for conditional_t.

a8b1f3b

Direct-init vector instead of calling resize().

144163b

Drop std::.

a2bb838

fix types in pointer test

5ea3d5a

simplify unique pointer test

c53a430

<memory> is no longer used.

b789701

Drop repeated TD alias.

2f51ed2

Value-init ptr_val_array.

8eb10d7

When is_pointer_v<T>, dis(gen) returns int.

2e9f007

Mark _Unique_fallback as noexcept.

e35ed5f

Avoid abbreviated function templates.

8928190

Vectorize remove_copy and unique_copy

7a1de3d

This comment was marked as resolved.

Sign in to view

StephanTLavavej added the performance Must go faster label Mar 23, 2025

This comment was marked as off-topic.

Sign in to view

AlexGuteniev marked this pull request as ready for review March 25, 2025 05:28

AlexGuteniev requested a review from a team as a code owner March 25, 2025 05:28

Merge remote-tracking branch 'upstream/main' into copycats

d37fde4

# Conflicts: # benchmarks/src/unique.cpp # stl/inc/algorithm # stl/src/vector_algorithms.cpp

AlexGuteniev force-pushed the copycats branch from 5f31ec5 to d37fde4 Compare March 25, 2025 05:53

StephanTLavavej self-assigned this Mar 25, 2025

simplify

4fd4a4c

StephanTLavavej requested changes Mar 25, 2025

View reviewed changes

stl/inc/algorithm Outdated Show resolved Hide resolved

StephanTLavavej removed their assignment Mar 25, 2025

Fix error

c1f5899

AlexGuteniev marked this pull request as draft March 25, 2025 20:23

short circuit

b9fd6a7

AlexGuteniev marked this pull request as ready for review March 26, 2025 06:09

AlexGuteniev requested a review from StephanTLavavej March 26, 2025 06:10

benchmark

ba3403b

StephanTLavavej self-assigned this Mar 26, 2025

AlexGuteniev added 2 commits March 26, 2025 08:42

merge error fix

5268f7e

Fix typo, consistently not using quotes

182a32c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Vectorize `remove_copy` and `unique_copy` #5355

Vectorize `remove_copy` and `unique_copy` #5355

AlexGuteniev commented Mar 23, 2025

This comment was marked as resolved.

This comment was marked as off-topic.

Vectorize remove_copy and unique_copy #5355

Are you sure you want to change the base?

Vectorize remove_copy and unique_copy #5355

Conversation

AlexGuteniev commented Mar 23, 2025

⚙️ The optimization

🚫 No find before remove_copy

✅ Test coverage

⏱️ Benchmark results

🥇 Results interpretation

This comment was marked as resolved.

This comment was marked as off-topic.

Vectorize `remove_copy` and `unique_copy` #5355

Vectorize `remove_copy` and `unique_copy` #5355

🚫 No `find` before `remove_copy`