Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request] Single compilation unit kernels and/or improved error messages #4030

Open
ematejska opened this issue Feb 26, 2025 · 2 comments
Labels
enhancement New feature or request max max-repo

Comments

@ematejska
Copy link
Collaborator

Originally filed by owenhilyard in #269. With the merge or max/mojo repos, opening here.

What is your request?

This is a 2-part request, but bundled since they both address the same UX issue.

Part one is to make the "custom op not found" error message direct users to documentation on how to include kernels. This is important since this is different for the Python and Mojo APIs, with Python doing it as part of the InferenceSession constructor in all of the examples but Mojo only being able to use the custom_ops_paths kwarg when loading a graph (which Python can as well, but it is deprecated there in favor of custom_extensions, further confusing things).

Part two is to make it so that a user can use a "single source" model for custom ops, ideally allowing for anything in the Mojo module tree to be pulled in. I know this likely means bundling MLIR inside of the binary in the data section (at least for ELF), but at present it appears that there is no way to have a self-contained binary that uses MAX. Ideally, this would be extended to Modular's kernels as well, to help with deploying MAX-based programs if the user is willing to write them in Mojo and to help with the total size of a MAX program/container.

Pinging @joshpeterson as requested.

What is your motivation for this change?

My initial assumption was that, since the Mojo compiler can see that I am using MAX, and I have the custom op I need right there and I import it, that I would be able to make use of it without needing to package a separate .mojopkg which has to live alongside my pure mojo binary. This also means iterating on a custom op is a 2-step process. First, compile the ops, then mojo run the program that is using the ops. That this didn't work was very confusing, since both SYCL and CUDA can do this. This was not helped by the error message, which did not direct me towards a solution for my problem, nor the divergence of the Mojo and Python APIs for MAX, since I would expect that the Mojo and Python APIs would be similar, with Mojo potentially needing to be more statically typed.

This also makes me concerned about the potential "minimum package size" for Mojo programs using MAX, since if they need to ship Modular's entire library of kernels due to a lack of tree shaking that may become very large over time and our current inability to tree shake .mojopkg libraries ourselves.

Any other details?

No response

@ematejska ematejska added enhancement New feature or request max labels Feb 26, 2025
@BradLarson
Copy link
Contributor

@owenhilyard - With the addition of an exposed interface for using the Mojo MAX Driver API to compile and launch GPU code defined in the same file (and examples of this in action ), does that satisfy the spirit of part two of your request here? Or is there more that you'd desire from the portion?

We're still working on a better compilation model and error messaging for part one of your request, more to come on that.

@owenhilyard
Copy link
Contributor

@BradLarson I think it gets a good chunk of the way there. However, for someone who is, say, iterating on a MAX custom op, this doesn't do very much unless they can abstract out the max parts to get at something which is usable via this model. That is doable for most of what I write, if a bit of a UX speedbump. For working on the GPU half of the ops, this will be great since I can prepare the inputs and expected outputs ahead of time and have a massive drop in compile time due to not compiling everything else.

Ideally, I would like to be able to have a custom op and a small main function which integrates it into a graph in the same file for iteration. However, as I mentioned that is reasonable to work around. I would like to see CPU support for single source custom ops, since it would remove the need to have that shim layer and most of my work is actually on CPUs, but I think it would only be nice for consistency's sake to add CPU support to max.driver.accelerator.compile and much less helpful.

For me, exposing the structure of the source code is a non-issue since it will be open source when I'm done, so dumping the .mojopkg into a section of the elf file is fine for me.

The general thought behind this issue is that I think MAX should be at least as convenient as CUDA and SYCL are when you are learning. For larger projects, I think that kernels tend to move to their own part of the codebase, but when people are learning there's a lot of friction introduced by forcing multiple compilation passes when they likely have a one or two file project and are trying to sum a vector or do a naive matmul to make sure everything works properly. It adds compounding frustration when things aren't working because you end up having to do compiles multiple times for every mistake, which I think is a likely place for people to decide to drop MAX and go back to their preferred library. I'd imagine it also makes it harder for the LSP to provide useful feedback, which is an area I think MAX could use some improvement on the Mojo side since, as I mentioned to Caroline, a lot of the API doesn't feel like a strongly typed library. I understand some of that is due to the implementation details of MAX, but mitigating that feeling is something I think would be helpful for making MAX attractive to people who like strongly typed languages.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request max max-repo
Projects
None yet
Development

No branches or pull requests

3 participants