Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

<bit>: Could has_single_bit() be faster? #5359

Open
pps83 opened this issue Mar 25, 2025 · 6 comments
Open

<bit>: Could has_single_bit() be faster? #5359

pps83 opened this issue Mar 25, 2025 · 6 comments
Labels
performance Must go faster

Comments

@pps83
Copy link
Contributor

pps83 commented Mar 25, 2025

Seems like it should convert to popcount(x) == 1 if it's available on the target arch, but looking at the code, doesn't seem like compiler would do it:

_NODISCARD constexpr bool has_single_bit(const _Ty _Val) noexcept {
    return _Val != 0 && (_Val & (_Val - 1)) == 0;
}

return _Val != 0 && (_Val & (_Val - 1)) == 0;

@AlexGuteniev
Copy link
Contributor

I'm not sure.

For the current implementation I'd expect something as following (assuming x64 arch, as it is primary optimization target):

or   reg1, reg1
jz   label1
lea  reg2, [reg1-1]
test reg1, reg2
label1:

And for popcount as following;

popcnt reg2, reg1
cmp    reg2, 1

Which of these is better would highly depend on the context. I'd expect the current branchy function to win for predictable input, and the popcnt branchless function towin for unpredictable input.

In sight of this uncertainty, I would avoid making any changes and introducing unnecessary complexity.

@AlexGuteniev
Copy link
Contributor

Oh, it also seems CPU vendor specific. AMDs have cheaper popcnt, could be the winner most of the time there.

@Alcaro
Copy link
Contributor

Alcaro commented Mar 26, 2025

According to GCC codegen, popcnt is the winner if available, yes.

If not, the best answer is (v ^ (v-1)) > v-1.

https://godbolt.org/z/d61bxW4r1

@StephanTLavavej StephanTLavavej changed the title Not optimal std::has_single_bit impl <bit>: Could has_single_bit() be faster? Mar 26, 2025
@StephanTLavavej StephanTLavavej added the performance Must go faster label Mar 26, 2025
@StephanTLavavej
Copy link
Member

Ideally, our compilers would do the right thing here, so we may need to report optimization bug(s) to C1XX and/or Clang.

@pps83
Copy link
Contributor Author

pps83 commented Mar 27, 2025

According to GCC codegen, popcnt is the winner if available, yes.

Yes, seems like gcc/clang use popcount for cpus starting from nehalem: https://godbolt.org/z/s4MnvcfW7
Nehalem was released in 2008.

If not, the best answer is (v ^ (v-1)) > v-1.

https://godbolt.org/z/d61bxW4r1

this one seems to be better for msvc and others for generic cpu. Looks like gcc/clang use this one as the codegen matches std::has_single_bit for x64/arm cpus

@pps83
Copy link
Contributor Author

pps83 commented Mar 27, 2025

I just edited that file in github.dev and made that PR directly from the browser. Crazy, full blown code editor in browser :)

Image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Must go faster
Projects
None yet
Development

No branches or pull requests

4 participants