Workaround for bad GPUDirect performance with unaligned GPU buffers #1143
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Opening this as a draft for reference. I think we should wait for responses from both the Umpire developers (LLNL/Umpire#881) and HPE before deciding if and what workaround to apply. This typically, but not always, gives reasonable performance after only one warmup iteration, and the warmup iteration isn't ridiculously slow compared to the best case. However, this always allocates at least 2MiB per allocation from Umpire and can end up wasting quite a lot of memory for small tiles. As an example the
gen_to_std
miniapp can look like this on current master:and most of the time looks like this on this PR:
The best case doesn't improve, but the worst case and variance significantly improve.