Skip to content

liangyuwang/simple_cuda_kernel

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Simple CUDA Kernels

A collection of ultra-simple yet high-performance CUDA kernels.

This repository provides minimal (~20 lines) implementations of essential CUDA kernels, achieving reasonable performance while maintaining maximum simplicity. Perfect for learning, experimenting, or building lightweight custom CUDA extensions.

Currently Implemented

Kernel Description PyTorch Equivalent Example API Usage Performance vs PyTorch
🔹 GEMV General Matrix-Vector Multiplication torch.mv(A, x) gemv(A, x) ⚡ ~72–141%
🔹 GEVM General Vector-Matrix Multiplication torch.matmul(x, A) gevm(x, A) ⚡ ~46–80%
🔹 GEMM General Matrix-Matrix Multiplication torch.mm(A, B) gemm(A, B) ⚡ ~14–25%
🔹 Batched GEMV Batched Matrix-Vector Multiplication torch.matmul(A, x) batched_gemv(A, x) ⚡ ~77–188%
🔹 Batched GEVM Batched Vector-Matrix Multiplication torch.matmul(x, A) batched_gevm(x, A) ⚡ ~80–98%
🔹 Batched GEMM Batched Matrix-Matrix Multiplication torch.bmm(A, B) batched_gemm(A, B) ⚡ ~14–18%

About

A collection of ultra-simple yet high-performance CUDA kernels.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published