intended behavior for >=1.17.4 for distributed training workloads with CVE fixes? #934

alexeldeib · 2025-02-21T16:41:31Z

I/my team were impacted by the changed from #877 after upgrading to 1.17.4 while maintaining security SLAs for https://nvd.nist.gov/vuln/detail/CVE-2025-23359

However upgrading to 1.17.4 without using the opt out feature flag breaks distributed training workloads relying on PTX compatibility across nodes with different versions, where the cuda compat libs are baked into the container.

I cannot currently find any method using 1.17.4 to support these workloads in multi-node environments where host drivers may differ. This was possible prior to 1.17.4, or with the feature flag opt out, as all containers across all hosts would see identical versions of libcuda. With the latest changes, no apparent amount of LD_LIBRARY_PATH or build-time ldconfig can mitigate the issue, as the runtime hook will symlink libcuda from the host with precedence over the container.

#906 seems to add back the same behavior prior to #877 but given the description of CVE-2025-23359, I don't understand how that doesn't reintroduce the same CVE by default if it continues to mount the compat libs from the container.

With a simple 2 node distributed test, one node on 1.17.3, one on 1.17.4, even with the same host driver, when the container libcuda does not match the host driver and you depend on PTX compatibility across nodes, you will get:

CUDA Error: the provided PTX was compiled with an unsupported toolchain.

What is the expected behavior for this scenario? How does #906 avoid the same issues intended to be fixed in response to the CVE?

The only workable change seems a runtime ldconfig in all applications with this dependency, which is very tedious.

The text was updated successfully, but these errors were encountered:

elezar · 2025-02-24T07:53:16Z

How does #906 avoid the same issues intended to be fixed in response to the CVE?

The change in #906 implement a hook that only creates a file in the containers file system under /etc/ld.so.conf.d. There are no mount operations being performed in this case meaning that the mechanism is not suceptible to the same TOCTOU race conditions as led to CVE-2024-0132 and CVE-2025-23359.

alexeldeib · 2025-02-24T14:59:15Z

got it, makes sense.

re: the behavior in general -- should users be explicit about using the cuda compat libs, or is the forward compat hook from nvctk a long term contract/solution?

I guess what you mentioned is also the reason CDI was not affected?

elezar · 2025-02-24T15:10:40Z

re: the behavior in general -- should users be explicit about using the cuda compat libs, or is the forward compat hook from nvctk a long term contract/solution?

I would say that it is not scalable for users to manage the CUDA forward compat libraries from in the context of portable container images. Some mechansims to ensure that forward compatibility works in a container must be included in the NVIDIA Container Toolkit. We are aware that our changes from #877 broke this contract and the decision to do this was not taken lightly.

As a matter of interest, would you be able to test the changes from #906 in your infrastructure?

alexeldeib · 2025-02-24T15:57:35Z

As a matter of interest, would you be able to test the changes from #906 in your infrastructure?

yep, I'm coordinating with folks to make sure we can test this as soon as it merges. It will take a little legwork on our side. I don't think we do source builds right now for nvctk but I can check if that's workable for pre-merge, otherwise happy to test when you folks finalize everything on main.

alexeldeib · 2025-02-24T16:04:10Z

potentially dumb question -- I only found about this behavior when it broke us. was mounting the compat libs unconditional or conditional on host driver version in the past, like the new PR?

a scenario where your application is built against CUDA 12.5, and includes the compat libs in the container, but runs on a mix of CUDA 12.4 and 12.6 hosts (with corresponding host driver + libcuda version) would never have worked for the behavior I described, because you either force 12.5 everywhere and have a lower libcuda than host driver for 12.5 < 12.6, or you have libcuda from the host and then hit a mismatch resulting in the PTX error I shared in distributed scenarios.

maybe the real question is should we be using PTX in a way that results in this behavior across nodes (semi rhetorical but also unsure?). I realize that might be out of scope for container stack folks to answer :)

alexeldeib · 2025-02-24T16:07:01Z

just to be thorough -- it seems like if I swap some code paths and lean on cudnn or jax at the application layer rather than custom cuda or cutlass kernels, I don't see the same issue -- which makes me feel like there's some extra user error in the fact that we depended on this working at all. but there could be some confounding factor I'm not immediately seeing.

alexeldeib mentioned this issue Feb 21, 2025

Add CUDA forward compatibility hook #906

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

intended behavior for >=1.17.4 for distributed training workloads with CVE fixes? #934

intended behavior for >=1.17.4 for distributed training workloads with CVE fixes? #934

alexeldeib commented Feb 21, 2025

elezar commented Feb 24, 2025

alexeldeib commented Feb 24, 2025

elezar commented Feb 24, 2025

alexeldeib commented Feb 24, 2025

alexeldeib commented Feb 24, 2025 •

edited

Loading

alexeldeib commented Feb 24, 2025

intended behavior for >=1.17.4 for distributed training workloads with CVE fixes? #934

intended behavior for >=1.17.4 for distributed training workloads with CVE fixes? #934

Comments

alexeldeib commented Feb 21, 2025

elezar commented Feb 24, 2025

alexeldeib commented Feb 24, 2025

elezar commented Feb 24, 2025

alexeldeib commented Feb 24, 2025

alexeldeib commented Feb 24, 2025 • edited Loading

alexeldeib commented Feb 24, 2025

alexeldeib commented Feb 24, 2025 •

edited

Loading