-
Notifications
You must be signed in to change notification settings - Fork 204
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add a CUDA device code sanity check #4692
base: 5.0.x
Are you sure you want to change the base?
Conversation
…_capabilities when CUDA is used
from easybuild.tools.systemtools import get_shared_lib_ext, pick_system_specific_value, use_group | ||
from easybuild.tools.systemtools import check_linked_shared_libs, det_parallelism, get_cuda_device_code_architectures | ||
from easybuild.tools.systemtools import get_linked_libs_raw, get_shared_lib_ext, pick_system_specific_value, use_group | ||
from easybuild.tools.toolchain.toolchain import TOOLCHAIN_CAPABILITY_CUDA |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
'easybuild.tools.toolchain.toolchain.TOOLCHAIN_CAPABILITY_CUDA' imported but unused
It's great that you looked into this, we've also been discussing it in EESSI: https://gitlab.com/eessi/support/-/issues/92 |
@ocaisa thanks for the link, I'll take a look Currently, main things I still plan to add to this pr:
|
I think it's a good idea to check for device code and ptx (with lack of ptx for the highest compute capability being a warning). The availability of ptx will allow you to run the application on future arch's. |
easybuild/tools/systemtools.py
Outdated
""" | ||
|
||
# cudaobjdump uses the sm_XY format | ||
device_code_regex = re.compile('(?<=arch = sm_)([0-9])([0-9]+a{0,1})') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be good to also capture whether the code can be jit compiled (so it can at least run on a future arch). In a script I had I did this with:
# Regex to find multiple PTX and ELF sections
ptx_matches = re.findall(r'Fatbin ptx code:\n=+\narch = sm_(\d+)', result.stdout)
elf_matches = re.findall(r'Fatbin elf code:\n=+\narch = sm_(\d+)', result.stdout)
# Debug: Show if matches were found for PTX and ELF sections
if debug:
print(f"PTX Matches: {ptx_matches}")
print(f"ELF Matches: {elf_matches}")
# Return all PTX and ELF matches, remove duplicates using set and convert to lists
return {
"ptx": sorted(set(ptx_matches)), # List of unique PTX capabilities
"elf": sorted(set(elf_matches)) # List of unique ELF capabilities
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In fact: re.compile('(?<=arch = sm_)([0-9])([0-9]+a{0,1})')
is not specific enough, because it will treat the Fatbin ptx code
and Fatbin elf code
sections the same: it'll just extract any arch =
string it can find.
To have a concrete example of something that has both, one can check e.g. libcusparse
:
[casparl@tcn1 ~]$ cuobjdump /cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/amd/zen2/accel/nvidia/cc80/software/CUDA/12.1.1/lib64/libcusparse.so | grep -A 5 ptx | tail -n 12
================
arch = sm_80
code version = [8,1]
host = linux
compile_size = 64bit
--
Fatbin ptx code:
================
arch = sm_90
code version = [8,1]
host = linux
compile_size = 64bit
[casparl@tcn1 ~]$ cuobjdump /cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/amd/zen2/accel/nvidia/cc80/software/CUDA/12.1.1/lib64/libcusparse.so | grep -A 5 elf | tail -n 12
================
arch = sm_80
code version = [1,7]
host = linux
compile_size = 64bit
--
Fatbin elf code:
================
arch = sm_90
code version = [1,7]
host = linux
compile_size = 64bit
@@ -3900,6 +3955,14 @@ def xs2str(xs): | |||
else: | |||
self.log.debug("Skipping RPATH sanity check") | |||
|
|||
if get_software_root('CUDA'): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@boegel We have an EESSI-specific complication here. We drop CUDA to a build time dep so that we don't depend on the CUDA module at runtime. This means that we won't execute this code path so we need to trigger the module load here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you're right, but just to double check: are the build dependencies unloaded at sanity check time?
Could we fix this through an EasyBuild hook in EESSI, that loads the CUDA that was a build dependency also in the sanity_check_step (and unloads after)? Should also work for EESSI-extend
, and no changes on the framework side needed...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wait... actually, I don't think you're right. Because I just did this with EESSI-extend
, and it did run the CUDA sanity check...? I'm not sure why, I would have expected the problem you mentioned. So... why didn't it appear?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder what happens in the --module-only
case, when the sanity check step is being run without building first? Perhaps this really is expected behaviour?
self.log.info("Using default subdirectories for binaries/libraries to verify CUDA device code: %s", | ||
cuda_dirs) | ||
else: | ||
self.log.info("Using default subdirectories for binaries/libraries to verify CUDA device code: %s", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This info message seems wrong: this is not the default subdirectories, this is a custom defined bin_lib_subdirs
FYI: I checked with @jfgrimm on chat, he probably has little time to work on it in the near future. Since this is a very valuable feature for EESSI that we'd like to have before we start building a large amount of GPU software, I'll try to work on this myself. Note that @jfgrimm was ok in me pushing to his branch, so I'll do that rather than create my own PR - at least we can have the full discussion in one place, namely here. |
I tested this as follows:
This resulted in
And many more. That's great, it means this PR is actually doing what it should. Indeed, checking manually:
So, yeah...
Note that there are many executables in CUDA-Samples that were build for the correct CC. E.g.:
|
Collecting some todo's:
|
Ignore list seems to work. Adding
to the EasyConfig for CUDA-Samples results in
and note that this binary does not get listed in the failure message. So that's the intended behavior: the warning is still printed, but it doesn't result in an error. |
Just to test: putting all of these files in the ignore list, the installation of CUDA-Samples now passes.
This provides a nice starting point for further tests, I can easily just remove one from the exclude list, and check that I get the expected result. |
So... the whole thing with checking PTX codes makes me rethink what EasyBuild should do when a certain
But what does that mean? What do we expect the nvcc compiler to do here? Say we were to compile a simple hello world, and I would do
i.e. would it only build device code for 80/90, and not include PTX? And build both through the lowest common virtual architecture? Or should it do
i.e. also include the PTX code for the
i.e. the stage one compilation is executed once for each CUDA compute capability, so that the generated
so that it actually includes not only the device codes for CC80 and CC90, but also the PTX code for CC90 (for forwards compatibility)? Honestly, from a performance perspective, I think it would be best if EasyBuild would indeed use the generalized arguments, so that the I.e. my proposal would be that if EasyBuild is configured with
Note that it may not always be possible to convince all build systems to actually do this - e.g. some codes might really only compile for a single cuda-compute-capability, or the build system doesn't make this distinction between real and virtual architectures to build for. Eventually the most robust and generic way to get this done might just be to implement I'm creating a CUDA hello world EasyConfig that we can use to serve as an easy example of 1) how we think I'm not sure what the best way forward is. If I include everything in this PR, it may be a bit heavy - although honestly at the framwork level it's just about defining the options, the real implementation would have to be done in EasyBlocks and EasyConfigs that use this information... My plan is to include the options in this PR, and make an accompanying PR for my CUDA hello world that uses these options in the way described above. The rest is then up to anyone updating or creating new EasyBlocks/EasyConfigs that somehow use information on the CUDA compute capability. |
Ok, change of plans. After thinking it over, this would be a massive scope creep that would delay the sanity check part that we primarily care about in this PR. Instead, in this PR, I'll focus on just that: a sanity check for the CUDA device codes. We can assume that everyone using EasyBuild expect this to be the meaning of the I will retain the code that prints a warning for the PTX code not matching the highest architecture. Or maybe demote it to an info message. In any case, it's convenient for future reference if EasyBuild extracts this information. I will not implement a strict option for the PTX code sanity check in this PR. It does not make sense to be sanity checking for behavior that we haven't clearly defined, i.e. there is no clear definition of what PTX code is expected to be included when someone sets |
Everything not sanity-check related is now described in this issue, which can be used to create one or more follow-up PRs. |
…nity check on surpluss CUDA archs if this option is set. Otherwise, print warning
Tested by adding
To
my build succeeds whereas with
It fails with:
as intended. Only thing left to do for this PR is tests. Not my strong suit to be honest, but let's see. I guess the tricky thing here is that a true test requires a real CUDA binary, and I'm not sure that's even feasible... To build one, I'd need a CUDA module in the test environment - I'm not sure if we have that. I could try to find a CUDA binary that we could just install (maybe just include a hello-world type of CUDA binary) and test with that... Maybe that's the most feasible option. But I have no clue if we can reasonable include binaries in the repo under the test directory. I have an 800KB hello world binary, that shouldn't be too crazy I guess. |
What you can do is create a mock |
Damn your good, it took me 25 more minutes of looking at other examples to figure out that even if I could ingest a binary, I'd lack the |
test_report_regexs=[regex]) | ||
|
||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
blank line contains whitespace
regex += "device code architectures match those in cuda_compute_capabilities" | ||
self.test_toy_build(extra_args=args, test_report=test_report_fp, raise_error=True | ||
test_report_regexs=[regex]) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
blank line contains whitespace
|
||
|
||
|
||
# Test single CUDA compute capability with --cuda-compute-capabilities=8.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
too many blank lines (6)
write_file(cuobjdump_file, cuobjdump_txt_sm80, append=True) | ||
adjust_permissions(cuobjdump_file, stat.S_IXUSR, add=True) # Make sure our mock cuobjdump is executable | ||
args = ['--cuda-compute-capabilities=8.0'] | ||
test_report_fp = os.path.join(self.test_buildpath, 'full_test_report.md') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
local variable 'test_report_fp' is assigned to but never used
]) | ||
|
||
# Section for cuobjdump printing output for sm_90 PTX code | ||
cuobjdump_txt_sm90_ptx = '\n'.join([ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
local variable 'cuobjdump_txt_sm90_ptx' is assigned to but never used
]) | ||
|
||
# Section for cuobjdump printing output for sm_80 PTX code | ||
cuobjdump_txt_sm80_ptx = '\n'.join([ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
local variable 'cuobjdump_txt_sm80_ptx' is assigned to but never used
]) | ||
|
||
# Section for cuobjdump printing output for sm_90 architecture | ||
cuobjdump_txt_sm90 = '\n'.join([ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
local variable 'cuobjdump_txt_sm90' is assigned to but never used
At the moment, we do no checking that the cuda compute capabilities that EasyBuild is configured to use, are actually used in the resultant binaries/libraries
WIP PR to introduce an extra sanity check when CUDA is present to check for mismatches between
cuda_compute_capabilities
and whatcuobjdump
reports