-
Notifications
You must be signed in to change notification settings - Fork 313
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
intended behavior for >=1.17.4 for distributed training workloads with CVE fixes? #934
Comments
The change in #906 implement a hook that only creates a file in the containers file system under |
got it, makes sense. re: the behavior in general -- should users be explicit about using the cuda compat libs, or is the forward compat hook from nvctk a long term contract/solution? I guess what you mentioned is also the reason CDI was not affected? |
I would say that it is not scalable for users to manage the CUDA forward compat libraries from in the context of portable container images. Some mechansims to ensure that forward compatibility works in a container must be included in the NVIDIA Container Toolkit. We are aware that our changes from #877 broke this contract and the decision to do this was not taken lightly. As a matter of interest, would you be able to test the changes from #906 in your infrastructure? |
yep, I'm coordinating with folks to make sure we can test this as soon as it merges. It will take a little legwork on our side. I don't think we do source builds right now for nvctk but I can check if that's workable for pre-merge, otherwise happy to test when you folks finalize everything on main. |
potentially dumb question -- I only found about this behavior when it broke us. was mounting the compat libs unconditional or conditional on host driver version in the past, like the new PR? a scenario where your application is built against CUDA 12.5, and includes the compat libs in the container, but runs on a mix of CUDA 12.4 and 12.6 hosts (with corresponding host driver + libcuda version) would never have worked for the behavior I described, because you either force 12.5 everywhere and have a lower libcuda than host driver for 12.5 < 12.6, or you have libcuda from the host and then hit a mismatch resulting in the PTX error I shared in distributed scenarios. maybe the real question is should we be using PTX in a way that results in this behavior across nodes (semi rhetorical but also unsure?). I realize that might be out of scope for container stack folks to answer :) |
just to be thorough -- it seems like if I swap some code paths and lean on cudnn or jax at the application layer rather than custom cuda or cutlass kernels, I don't see the same issue -- which makes me feel like there's some extra user error in the fact that we depended on this working at all. but there could be some confounding factor I'm not immediately seeing. |
I/my team were impacted by the changed from #877 after upgrading to 1.17.4 while maintaining security SLAs for https://nvd.nist.gov/vuln/detail/CVE-2025-23359
However upgrading to 1.17.4 without using the opt out feature flag breaks distributed training workloads relying on PTX compatibility across nodes with different versions, where the cuda compat libs are baked into the container.
I cannot currently find any method using 1.17.4 to support these workloads in multi-node environments where host drivers may differ. This was possible prior to 1.17.4, or with the feature flag opt out, as all containers across all hosts would see identical versions of libcuda. With the latest changes, no apparent amount of
LD_LIBRARY_PATH
or build-timeldconfig
can mitigate the issue, as the runtime hook will symlink libcuda from the host with precedence over the container.#906 seems to add back the same behavior prior to #877 but given the description of CVE-2025-23359, I don't understand how that doesn't reintroduce the same CVE by default if it continues to mount the compat libs from the container.
With a simple 2 node distributed test, one node on 1.17.3, one on 1.17.4, even with the same host driver, when the container libcuda does not match the host driver and you depend on PTX compatibility across nodes, you will get:
What is the expected behavior for this scenario? How does #906 avoid the same issues intended to be fixed in response to the CVE?
The only workable change seems a runtime ldconfig in all applications with this dependency, which is very tedious.
The text was updated successfully, but these errors were encountered: