A collection of ultra-simple yet high-performance CUDA kernels.
This repository provides minimal (~20 lines) implementations of essential CUDA kernels, achieving reasonable performance while maintaining maximum simplicity. Perfect for learning, experimenting, or building lightweight custom CUDA extensions.
Kernel | Description | PyTorch Equivalent | Example API Usage | Performance vs PyTorch |
---|---|---|---|---|
🔹 GEMV | General Matrix-Vector Multiplication | torch.mv(A, x) |
gemv(A, x) |
⚡ ~72–141% |
🔹 GEVM | General Vector-Matrix Multiplication | torch.matmul(x, A) |
gevm(x, A) |
⚡ ~46–80% |
🔹 GEMM | General Matrix-Matrix Multiplication | torch.mm(A, B) |
gemm(A, B) |
⚡ ~14–25% |
🔹 Batched GEMV | Batched Matrix-Vector Multiplication | torch.matmul(A, x) |
batched_gemv(A, x) |
⚡ ~77–188% |
🔹 Batched GEVM | Batched Vector-Matrix Multiplication | torch.matmul(x, A) |
batched_gevm(x, A) |
⚡ ~80–98% |
🔹 Batched GEMM | Batched Matrix-Matrix Multiplication | torch.bmm(A, B) |
batched_gemm(A, B) |
⚡ ~14–18% |