CubeCL Linear Algebra Library.

The crate contains common linear algebra algorithms.

Algorithms

Tiling 2D Matrix Multiplication.

The kernel is very flexible and can be used on pretty much any hardware.
Cooperative Matrix Multiplication.

The kernel is using Automatic Mixed Precision (AMP) to leverage cooperative matrix-multiply and accumulate instructions. For f32 tensors, the inputs are casted into f16, but the accumulation is still performed in f32. This may cause a small lost in precision, but with way faster execution.

Benchmarks

You can run the benchmarks from the workspace with the following:

cargo bench --bench matmul --features wgpu # for wgpu
cargo bench --bench matmul --features cuda # for cuda

On an RTX 3070 we get the following results:

matmul-wgpu-f32-tiling2d

―――――――― Result ―――――――――
  Samples     100
  Mean        13.289ms
  Variance    28.000ns
  Median      13.271ms
  Min         12.582ms
  Max         13.768ms
―――――――――――――――――――――――――
matmul-cuda-f32-tiling2d

―――――――― Result ―――――――――
  Samples     100
  Mean        12.754ms
  Variance    93.000ns
  Median      12.647ms
  Min         12.393ms
  Max         14.501ms
―――――――――――――――――――――――――
matmul-cuda-f32-cmma

―――――――― Result ―――――――――
  Samples     100
  Mean        4.996ms
  Variance    35.000ns
  Median      5.084ms
  Min         4.304ms
  Max         5.155ms
―――――――――――――――――――――――――

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

CubeCL Linear Algebra Library.

Algorithms

Benchmarks

Files

README.md

Latest commit

History

README.md

File metadata and controls

CubeCL Linear Algebra Library.

Algorithms

Benchmarks