Skip to content

Usefulness of supports_atomics in its current form #594

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
christiangnrd opened this issue May 2, 2025 · 7 comments
Open

Usefulness of supports_atomics in its current form #594

christiangnrd opened this issue May 2, 2025 · 7 comments

Comments

@christiangnrd
Copy link
Member

At the moment, supports_atomics returns a boolean, but different backends have different levels of support. For example, Metal essentially only supports 32-bt integers and floats, with 64-bit integer atomics being limited to min and max.

Would it be useful/worth it to add granularity to the function?

Came up when fixing the histogram example for oneAPI and Metal which don't support 64-bit atomics.

@vchuravy
Copy link
Member

vchuravy commented May 3, 2025

👍 these functions are post-hoc additions to attempt to allow these queries without breaking compatibility.

So we could add a more fine-grained interface that uses types as well.

@christiangnrd
Copy link
Member Author

What should the API look like? Essentially extending the supports_atomics signature to supports_atomics(::Backend, ::DataType)?

What about partial support? For example, Metal has some 64-bit atomic support, but only for ulong and with min and max modify operations. Should the operation also be part of the support check? Also, Metal only supports weak guarantees. From my (noob to atomics) understanding, that means the operations can fail sometimes? Should that factor into it? I believe @anicusan had to use a different algorithm in the AK.jl accumulate! because of the lack of strong memory guarantees.

This is kind of just me dumping my thoughts while I try to better understand atomics and how they relate to GPU programming/KA.

@vchuravy
Copy link
Member

vchuravy commented May 5, 2025

Yeah, I think the granularity: Datatype ought to suffice and indeed I would say that Metal does not support 64bit semantics.

We could, of course, make it much more fine-grained support_atomics(backend, type, op), but that starts to make it much more cumbersome to work with.

For weak vs non-weak. That ought to matter only for compare-and-swap? And the way you support this generally is to use a loop. IIRC Atomix provides that fallback loop?

Now if that loop matches the progress requirement of the GPU is a different story.

@christiangnrd
Copy link
Member Author

I opened #595 for more concrete feedback. I feel like that may be all that's needed on the KA side.

@anicusan
Copy link
Member

anicusan commented May 5, 2025

What is the exact use case for this? Querying whether a backend supports a datatype - be it for atomics or normal use - means you're writing type-generic code anyways. If you're exposing granular capabilities from KA, a downstream library developer would then have to write checks in case a user wanted to use an unsupported data type - and in the end just throw an error. Is there anything else to do in such a case?

Why not just have a pass at the KA level for unsupported operations that throws a useful error? E.g. right now, if you wanted to use a Float64 on Metal, you'd get a resulted in invalid IR error; if you used an @atomic on an Int64, you'd get something like NSError: Compiler encountered an internal error (AGXMetalG15X_M1... error.

I suppose query-able capabilities only make sense if you wanted different codepaths depending on their result. We had to do that in AcceleratedKernels for Metal accumulate and Intel any, but they were rather niche bugs that I wouldn't even know how to query for (is there a definition that's correct over all backends for "strong" memory guarantees? Or not allowing simultaneous writes to the same memory location?) - and in the end, a simple keyword-argument switch to a different algorithm was sufficient.

@christiangnrd
Copy link
Member Author

christiangnrd commented May 5, 2025

Thanks for weighing in @anicusan. I was hoping you would as you have practical experience using atomics with KA code. At the moment, it was an attempt to figure out the role of supports_atomics somewhat useful in preparation for my GSOC project (and before the next breaking release). I think it could potentially be helpful in tests to figure out which tests to run in which backends.

However, after reading your comment, I'm doubting its general usefulness. For example, Metal has some support for atomics, but not everything, and we've forgotten to undefine the function after support was added and no one has complained yet, so maybe we should get rid of it and implement one of your suggestions.

Maybe all this need is good documentation outlining the various atomics support of each backend? However I can see this becoming stale quite quickly so maybe not the best suggestion.

@anicusan
Copy link
Member

anicusan commented May 5, 2025

Thanks for @-ing me! In my mind there are two layers to expose from KA:

  1. Backend capabilities for algorithm writing.
  2. Device settings relevant to algorithm running.

For 1., I think the differences between platforms are too subtle - and often not specified enough by the official docs themselves - to be generally useful; if we consider the 5 fundamental algorithms in AK (mapreduce, accumulate, sort, any, foreachindex) for 5 backends, only 2 of these combinations had quirks that needed workarounds, so 8%, and even those, in extremely specific implementation details. I'm personally fine with finding them out as they come, and the workaround (via a kwarg) is simple enough that I don't think it's necessary to expose more KA surface area to maintain there. A helpful error message and/or some docs would be good, though they can also be in a "debugging" fashion rather than capability promises (e.g. AcceleratedKernels troubleshooting).

I think 2. would be much more useful, and probably a low-hanging fruit - for example, it would be great to query the maximum group size of a given device (many algorithms scale with the number of blocks to process, so it'd be good to just use the maximum, but we don't have that exposed - so most block_size=256 in AK as a conservative value that works even on old Intel oneAPIs). Others would be the warp size, or, in a backend-agnostic manner, the number of threads that are guaranteed to execute in sync (on CPUs that would be 1, on CUDA that would be 32). Other device-relevant constants - though of less importance in what I've been writing - would be the maximum shared memory per block, and the actual number of processors available.

Also, probably deciding on which terminology to use would be good for consistency, between all block/group/thread/item/etc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants