Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG]: ops.matmul errors when batch dimension is larger than 65535 (2^16-1) on GPU #4047

Open
gphlipot opened this issue Mar 3, 2025 · 0 comments
Labels
bug Something isn't working max max-repo

Comments

@gphlipot
Copy link

gphlipot commented Mar 3, 2025

Bug description

When running a graph with an ops.matmul that operates on tensors with batch size larger than 2^16-1, I get a cuda error: CUDA call failed: CUDA_ERROR_INVALID_VALUE (invalid argument). The same graph executes fine when using a CPU.

Steps to reproduce

Here is a python code that calls ops.matmul on with a Bx1x1 tensor. It is run on a CPU and a GPU with batch sizes B = 65535 and B = 65536.

from max.driver import Accelerator, Tensor, CPU
from max.dtype import DType
from max.engine import InferenceSession
from max.graph import Graph, TensorType, ops


with Graph(
    "batch_matmul",
    input_types=[
        TensorType(dtype=DType.float32, shape=("batch", 1, 1)),
    ],
) as graph:
    x = graph.inputs[0]
    y = ops.matmul(x, x)
    graph.output(y)


for device in [CPU(), Accelerator()]:
    session = InferenceSession(devices=[device])
    model = session.load(graph)
    for batch_size in [65535, 65536]:
        x = Tensor.zeros((batch_size, 1, 1), DType.float32)
        print(f"running matmul on {device.label} with batch size {batch_size}")
        result = model.execute(x)[0]

When I run this on my machine (with RTX 3090), I get output like:

running matmul on cpu with batch size 65535
running matmul on cpu with batch size 65536
running matmul on gpu with batch size 65535
running matmul on gpu with batch size 65536
...
ValueError: At Kernels/mojo/gpu/host/device_context.mojo:1685:26: CUDA call failed: CUDA_ERROR_INVALID_VALUE (invalid argument)

System information

$ magic info
     Magic version: 0.7.0
System
------------
       Pixi version: 0.41.3
           Platform: linux-64
   Virtual packages: __unix=0=0
                   : __linux=6.8.0=0
                   : __glibc=2.39=0
                   : __cuda=12.8=0
                   : __archspec=1=zen4



$ magic list max
Package     Version               Build    Size       Kind   Source
max         25.2.0.dev2025030205  release  9.7 KiB    conda  max
max-core    25.2.0.dev2025030205  release  238.3 MiB  conda  max-core
max-python  25.2.0.dev2025030205  release  117.9 MiB  conda  max-python
@gphlipot gphlipot added bug Something isn't working max labels Mar 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working max max-repo
Projects
None yet
Development

No branches or pull requests

2 participants