Skip to content

Commit 2294244

Browse files
committed
Rust CUDA May update blog post
1 parent fba1439 commit 2294244

File tree

1 file changed

+160
-0
lines changed

1 file changed

+160
-0
lines changed

blog/2025-05-27-rust-cuda-update.mdx

Lines changed: 160 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,160 @@
1+
---
2+
title: "Rust CUDA May 2025 project update"
3+
authors: [LegNeato]
4+
tags: ["announcement", "cuda"]
5+
---
6+
7+
import Gh from "@site/blog/src/components/UserMention";
8+
9+
Rust CUDA enables you to write and run [CUDA](https://developer.nvidia.com/cuda-toolkit)
10+
kernels in Rust, executing directly on NVIDIA GPUs using [NVVM
11+
IR](https://docs.nvidia.com/cuda/nvvm-ir-spec/index.html).
12+
13+
Work is ongoing in the project and we wanted to share an update.
14+
15+
**To follow along or get involved, check out the [`rust-cuda` repo on GitHub](https://github.com/rust-gpu/rust-cuda).**
16+
17+
<!-- truncate -->
18+
19+
## New Docker images
20+
21+
Thanks to <Gh user="adamcavendish" />, we now automatically build and publish Docker
22+
images as part of CI. These images are based on [NVIDIA's official CUDA
23+
containers](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html)
24+
and come preconfigured to build and run Rust GPU kernels.
25+
26+
Rust CUDA uses [NVVM](https://docs.nvidia.com/cuda/nvvm-ir-spec/) under the hood, which
27+
is NVIDIA's LLVM-based CUDA frontend. NVVM is currently based on LLVM 7 and getting it
28+
set up manually can be tedious and error-prone. These images solve the setup issue.
29+
30+
## Improved constant memory handling
31+
32+
### Background
33+
34+
CUDA exposes [distinct memory
35+
spaces](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#memory-hierarchy),
36+
each with different characteristics:
37+
38+
| Memory Space | Scope | Speed | Size | Use Case |
39+
| :-------------- | :---------- | :-------- | :---------- | :---------------------------------------- |
40+
| Registers | Per thread | Fastest | Very small | Thread-local temporaries |
41+
| Shared memory | Per block | Fast | ~48 KB | Inter-thread communication within a block |
42+
| Constant memory | Device-wide | Fast read | 64 KB total | Read-only values broadcast to all threads |
43+
| Global memory | Device-wide | Slower | GBs | General-purpose read/write memory |
44+
45+
CUDA C++ code is often monolithic with minimal abstraction and everything in one file.
46+
Rust CUDA brings idiomatic Rust to GPU programming and encourages modularity, traits,
47+
generics, and reuse of third-party `no_std` crates from [crates.io](https://crates.io).
48+
As a result, CUDA programs written in Rust tend to be more complex and depend on more
49+
static data spread across your code and its dependencies.
50+
51+
A good example is
52+
[`curve25519-dalek`](https://docs.rs/curve25519-dalek/latest/curve25519_dalek/), a
53+
cryptographic crate that defines large static lookup tables for scalar multiplication
54+
and point decompression. These values are immutable and read-only—ideal for constant
55+
memory—but together exceed the 64 KB limit. Using `curve25519-dalek` as a dependency
56+
means your kernel's static data will never entirely fit in constant memory.
57+
58+
### The issue
59+
60+
Previously, Rust CUDA would try to place all eligible static values into constant memory
61+
automatically. If you had too many, or one was too big, your kernel would break at
62+
runtime and CUDA would return an `IllegalAddress` error with no clear cause.
63+
64+
Manual placement via `#[cuda_std::address_space(constant)]` or
65+
`#[cuda_std::address_space(global)]` was possible, but only for code you controlled. The
66+
annotations did not help for dependencies pulled from crates.io. This made it dangerous
67+
to use larger crates or write more modular GPU programs as at any point they might tip
68+
over the 64 KB limit and start throwing runtime errors.
69+
70+
This situation had the potential to create frustrating and difficult-to-diagnose bugs.
71+
For example:
72+
73+
- Adding a new `no_std` crate to a project could inadvertently push total static data size over the constant memory limit, causing crashes. This could happen even if the new crate's functionality was not directly invoked, simply due to the inclusion of its static data.
74+
- A kernel might function correctly in one build configuration but fail in another if different features or Cargo flags led to changes in which static variables were included in the final binary.
75+
- If a large static variable was initially unused, the compiler might optimize it away. If subsequent code changes caused that static to be referenced, it would be included, potentially triggering the memory limit and causing runtime failures.
76+
- Code behavior could vary unexpectedly across different versions of a dependency or between debug and release builds.
77+
78+
### The Fix
79+
80+
New contributor <Gh user="brandonros" /> and Rust CUDA maintainer <Gh user="LegNeato" />
81+
[landed a
82+
change](https://github.com/Rust-GPU/Rust-CUDA/commit/afb147ed51fbb14b758e10a0a24dbc2311a52b82)
83+
that avoids those pitfalls with a conservative default and a safe opt-in mechanism:
84+
85+
1. By default all statics are placed in global memory.
86+
87+
2. A new opt-in flag, `--use-constant-memory-space`, enables automatic placement in constant memory.
88+
89+
3. If a static is too large it is spilled to global memory automatically, even when the
90+
flag is enabled.
91+
92+
4. Manual overrides with `#[cuda_std::address_space(constant)]` or
93+
`#[cuda_std::address_space(global)]` still work and take precedence.
94+
95+
<br />
96+
97+
This gives developers some level of control without the risk of unstable runtime
98+
behavior.
99+
100+
### Future work
101+
102+
This change prevents runtime errors and hard-to-debug issues but may reduce performance
103+
in some cases by not fully utilizing constant memory.
104+
105+
The long-term goal is to make automatic constant memory placement smarter so we can turn
106+
it on by default without breaking user code. To get there, we need infrastructure to
107+
support correct and tunable placement logic.
108+
109+
Planned improvements include:
110+
111+
1. Tracking total constant memory usage across all static variables during codegen.
112+
2. Spilling based on cumulative usage, not just individual static size.
113+
3. Failing at compile time when the limit is exceeded, especially for manually annotated statics.
114+
4. Compiler warnings when usage is close to the 64 KB limit, perhaps with a configurable
115+
range.
116+
5. User-defined packing policies, such as prioritizing constant placement of small or
117+
large statics, or statics from a particular crate.
118+
119+
<br />
120+
121+
These should give developers control and enable using profiling data or usage frequency
122+
to drive placement decisions for maximum performance.
123+
124+
**If these improvements sound interesting to you, join us in [issue
125+
#218](https://github.com/Rust-GPU/Rust-CUDA/issues/218).** We're always looking for new
126+
contributors!
127+
128+
## Updated examples and CI
129+
130+
<Gh user="giantcow" />, <Gh user="jorge-ortega" />, <Gh user="adamcavendish" />, and <Gh
131+
user="LegNeato" /> fixed broken examples, cleaned up CI, and added a new
132+
[GEMM](https://en.wikipedia.org/wiki/Basic_Linear_Algebra_Subprograms#Level_3) example.
133+
These steady improvements are important to keep the project healthy and usable.
134+
135+
## Cleaned up bindings
136+
137+
CUDA libraries ship as binary objects, typically wrapped in Rust using `-sys` crates.
138+
With many subframeworks like [cuDNN](https://developer.nvidia.com/cudnn),
139+
[cuBLAS](https://developer.nvidia.com/cublas), and
140+
[OptiX](https://developer.nvidia.com/optix), maintaining these crates requires
141+
generating bindings automatically via
142+
[`bindgen`](https://github.com/rust-lang/rust-bindgen).
143+
144+
<Gh user="adamcavendish" /> and <Gh user="jorge-ortega" /> streamlined our `bindgen`
145+
setup to simplify maintenance and make subframeworks easier to include or exclude.
146+
147+
## Call for contributors
148+
149+
We need your help to shape the future of CUDA programming in Rust. Whether you're a
150+
maintainer, contributor, or user, there's an opportunity to [get
151+
involved](https://github.com/rust-gpu/rust-cuda). We're especially interested in adding
152+
maintainers to make the project sustainable.
153+
154+
Be aware that the process may be a bit bumpy as we are still getting the project in
155+
order.
156+
157+
If you'd prefer to focus on non-proprietary and multi-vendor platforms, check out our
158+
related **[Rust GPU](https://rust-gpu.github.io/)** project. It is similar to Rust CUDA
159+
but targets [SPIR-V](https://www.khronos.org/spir/) for
160+
[Vulkan](https://www.vulkan.org/) GPUs.

0 commit comments

Comments
 (0)