You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently, blocks where blockIdx.z = 0 will set the output to zero here. However, due to the absence of synchronization between blocks, is it possible that other blocks
with different blockIdx.z might complete their calculations and update the output before it's zeroed out?
The text was updated successfully, but these errors were encountered:
Conventional wisdom would say yes, it's possible for blocks to launch in any order. In practice (and I tested this a lot) I've never seen block (x, y, 0) launch after (x, y, 1). There's some discussion of it here and some tests of the correlation between launch time and blockIdx, which turns out to be very strong.
Of course, relying on it is still a little hacky since there are no actual guarantees from NVIDIA. I definitely want to find a better solution, and I'm trying to rework the kernel to do FP32 accumulation anyway, which may make it a non-issue soon.
Actually the concern is, modern GPU have enough SMs that block (x, y, 0) and (x, y, 1) are launched concurrently, it's hard to guarantee block 0 will finish the zero-fill first.
Currently, blocks where
blockIdx.z = 0
will set the output to zero here. However, due to the absence of synchronization between blocks, is it possible that other blockswith different blockIdx.z might complete their calculations and update the output before it's zeroed out?
The text was updated successfully, but these errors were encountered: