-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a donotdelete
builtin
#44036
Add a donotdelete
builtin
#44036
Conversation
This seems to be related to @vchuravy's JuliaCI/BenchmarkTools.jl#92 which included Can we have a more intuitive name like google/benchmark's
Can we use |
Yes, but that has the same optimizability challenges, and perhaps even more. I thought the volatile store might at least have some chance of not interfering with loop vectorization. |
@preames any thoughts on this? |
It's intended to be consistent with |
base/docs/basedocs.jl
Outdated
actually computed. (Otherwise DCE may see that the result of the benchmark is | ||
unused and delete the entire benchmark code). | ||
|
||
**Note**: `dcebarrier` does not affect constant foloding. For example, in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
**Note**: `dcebarrier` does not affect constant foloding. For example, in | |
**Note**: `dcebarrier` does not affect constant folding. For example, in |
There's some prior art on this type of thing in Java with JMH's Blackhole.consume. Naming wise, I would find something along those lines better than dcebarrier. As can already be seen in the discussion above, use of the word "barrier" gives the impression that the call has memory effects, whereas that seems not to be the intent per the draft wording. Implementation wise, I would start by lowering to an external function call marked "inaccessiblememonly nounwind willreturn" at the LLVM level. This would have some cost - the actual call sequence - but should have minimal impact on optimization. I would be leery of the volatile store to alloca lowering. volatiles are generally not touched, but there is precedent for removing them if the location being touched is well understood. An alloca seems like an entirely reasonable location for the compiler to assume is not memory mapped IO. Once implemented with the external call, we could chose to add an LLVM intrinsic with the same meaning. I think this is a broadly reuseable concept, and probably wouldn't be too hard to get upstream. |
|
Upon discussion with @JeffBezanson and @vtjnash, they preferred a name that did not require a graduate course on the blackhole information paradox in order to build the correct intuition about whether or not the optimizer is allowed to delete the value or not. We ultimately settled on |
Alright, I guess, we should merge the BenchmarkTools version first, then do a nanosoldier run here to see what the effect is (we expect regressions because it's a change in what's being benchmarked), just so we have a baseline. |
The new BenchmarkTools version also has to get deployed explicitly on Nanosoldier. |
I've tagged BenchmarkTools 1.3 and according to @vtjnash, Nanosolider will pick up the latest registered version, so we'll wait for that to go through. I'll rebase this in the meantime, since it's accumulated conflicts. |
In #43852 we noticed that the compiler is getting good enough to completely DCE a number of our benchmarks. We need to add some sort of mechanism to prevent the compiler from doing so. This adds just such an intrinsic. The intrinsic itself doesn't do anything, but it is considered effectful by our optimizer, preventing it from being DCE'd. At the LLVM level, it turns into a volatile store to an alloca (or an llvm.sideeffect if the values passed to the `dcebarrier` do not have any actual LLVM-level representation). The docs for the new intrinsic are as follows: ``` dcebarrier(args...) This function prevents dead-code elimination (DCE) of itself and any arguments passed to it, but is otherwise the lightest barrier possible. In particular, it is not a GC safepoint, does model an observable heap effect, does not expand to any code itself and may be re-ordered with respect to other side effects (though the total number of executions may not change). A useful model for this function is that it hashes all memory `reachable` from args and escapes this information through some observable side-channel that does not otherwise impact program behavior. Of course that's just a model. The function does nothing and returns `nothing`. This is intended for use in benchmarks that want to guarantee that `args` are actually computed. (Otherwise DCE may see that the result of the benchmark is unused and delete the entire benchmark code). **Note**: `dcebarrier` does not affect constant foloding. For example, in `dcebarrier(1+1)`, no add instruction needs to be executed at runtime and the code is semantically equivalent to `dcebarrier(2).` *# Examples function loop() for i = 1:1000 # The complier must guarantee that there are 1000 program points (in the correct # order) at which the value of `i` is in a register, but has otherwise # total control over the program. dcebarrier(i) end end ``` I believe the voltatile store at the LLVM level is actually somewhat stronger than what we want here. Ideally the `dcebarrier` would not and up generating any machine code at all and would also be compatible with optimizations like SROA and vectorization. However, I think this is fine for now.
@nanosoldier |
Something went wrong when running your job:
Unfortunately, the logs could not be uploaded. |
Not too bad. None get faster (of course), but only a handful got badly affected: https://github.com/JuliaCI/NanosoldierReports/blob/master/benchmark/by_hash/95a9e7f_vs_60f414e/report.md |
Yep, pretty much as expected. The benchmarks that got affected are the scalar ones that are essentially trivial, so it's like for LLVM to have deleted them. Looks like this is working. Excellent. |
Is it possible that this PR broke Windows CI? |
So it did. Looks like the new test failed. Will fix. |
FnAttrs.addAttribute(C, Attribute::InaccessibleMemOnly); | ||
FnAttrs.addAttribute(C, Attribute::WillReturn); | ||
FnAttrs.addAttribute(C, Attribute::NoUnwind); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not an AttrBuilder. These calls do no have any effects and will be deleted.
/Users/jameson/julia1/src/codegen.cpp:479:5: warning: ignoring return value of function declared with 'warn_unused_result' attribute [-Wunused-result]
FnAttrs.addAttribute(C, Attribute::InaccessibleMemOnly);
^~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/Users/jameson/julia1/src/codegen.cpp:480:5: warning: ignoring return value of function declared with 'warn_unused_result' attribute [-Wunused-result]
FnAttrs.addAttribute(C, Attribute::WillReturn);
^~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~
/Users/jameson/julia1/src/codegen.cpp:481:5: warning: ignoring return value of function declared with 'warn_unused_result' attribute [-Wunused-result]
FnAttrs.addAttribute(C, Attribute::NoUnwind);
^~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed by #44097
backport? |
In JuliaLang#43852 we noticed that the compiler is getting good enough to completely DCE a number of our benchmarks. We need to add some sort of mechanism to prevent the compiler from doing so. This adds just such an intrinsic. The intrinsic itself doesn't do anything, but it is considered effectful by our optimizer, preventing it from being DCE'd. At the LLVM level, it turns into call to an external varargs function. The docs for the new intrinsic are as follows: ``` donotdelete(args...) This function prevents dead-code elimination (DCE) of itself and any arguments passed to it, but is otherwise the lightest barrier possible. In particular, it is not a GC safepoint, does model an observable heap effect, does not expand to any code itself and may be re-ordered with respect to other side effects (though the total number of executions may not change). A useful model for this function is that it hashes all memory `reachable` from args and escapes this information through some observable side-channel that does not otherwise impact program behavior. Of course that's just a model. The function does nothing and returns `nothing`. This is intended for use in benchmarks that want to guarantee that `args` are actually computed. (Otherwise DCE may see that the result of the benchmark is unused and delete the entire benchmark code). **Note**: `donotdelete` does not affect constant foloding. For example, in `donotdelete(1+1)`, no add instruction needs to be executed at runtime and the code is semantically equivalent to `donotdelete(2).` *# Examples function loop() for i = 1:1000 # The complier must guarantee that there are 1000 program points (in the correct # order) at which the value of `i` is in a register, but has otherwise # total control over the program. donotdelete(i) end end ```
In JuliaLang#43852 we noticed that the compiler is getting good enough to completely DCE a number of our benchmarks. We need to add some sort of mechanism to prevent the compiler from doing so. This adds just such an intrinsic. The intrinsic itself doesn't do anything, but it is considered effectful by our optimizer, preventing it from being DCE'd. At the LLVM level, it turns into call to an external varargs function. The docs for the new intrinsic are as follows: ``` donotdelete(args...) This function prevents dead-code elimination (DCE) of itself and any arguments passed to it, but is otherwise the lightest barrier possible. In particular, it is not a GC safepoint, does model an observable heap effect, does not expand to any code itself and may be re-ordered with respect to other side effects (though the total number of executions may not change). A useful model for this function is that it hashes all memory `reachable` from args and escapes this information through some observable side-channel that does not otherwise impact program behavior. Of course that's just a model. The function does nothing and returns `nothing`. This is intended for use in benchmarks that want to guarantee that `args` are actually computed. (Otherwise DCE may see that the result of the benchmark is unused and delete the entire benchmark code). **Note**: `donotdelete` does not affect constant foloding. For example, in `donotdelete(1+1)`, no add instruction needs to be executed at runtime and the code is semantically equivalent to `donotdelete(2).` *# Examples function loop() for i = 1:1000 # The complier must guarantee that there are 1000 program points (in the correct # order) at which the value of `i` is in a register, but has otherwise # total control over the program. donotdelete(i) end end ```
In JuliaLang#43852 we noticed that the compiler is getting good enough to completely DCE a number of our benchmarks. We need to add some sort of mechanism to prevent the compiler from doing so. This adds just such an intrinsic. The intrinsic itself doesn't do anything, but it is considered effectful by our optimizer, preventing it from being DCE'd. At the LLVM level, it turns into call to an external varargs function. The docs for the new intrinsic are as follows: ``` donotdelete(args...) This function prevents dead-code elimination (DCE) of itself and any arguments passed to it, but is otherwise the lightest barrier possible. In particular, it is not a GC safepoint, does model an observable heap effect, does not expand to any code itself and may be re-ordered with respect to other side effects (though the total number of executions may not change). A useful model for this function is that it hashes all memory `reachable` from args and escapes this information through some observable side-channel that does not otherwise impact program behavior. Of course that's just a model. The function does nothing and returns `nothing`. This is intended for use in benchmarks that want to guarantee that `args` are actually computed. (Otherwise DCE may see that the result of the benchmark is unused and delete the entire benchmark code). **Note**: `donotdelete` does not affect constant foloding. For example, in `donotdelete(1+1)`, no add instruction needs to be executed at runtime and the code is semantically equivalent to `donotdelete(2).` *# Examples function loop() for i = 1:1000 # The complier must guarantee that there are 1000 program points (in the correct # order) at which the value of `i` is in a register, but has otherwise # total control over the program. donotdelete(i) end end ```
In #43852 we noticed that the compiler is getting good enough to
completely DCE a number of our benchmarks. We need to add some sort
of mechanism to prevent the compiler from doing so. This adds just
such an intrinsic. The intrinsic itself doesn't do anything, but
it is considered effectful by our optimizer, preventing it from
being DCE'd. At the LLVM level, it turns into a volatile store to
an alloca (or an llvm.sideeffect if the values passed to the
dcebarrier
do not have any actual LLVM-level representation).The docs for the new intrinsic are as follows:
I believe the voltatile store at the LLVM level is actually somewhat
stronger than what we want here. Ideally the
dcebarrier
would notand up generating any machine code at all and would also be compatible
with optimizations like SROA and vectorization. However, I think this
is fine for now.