[torch.compile] Fuse RMSNorm with quant #9138

ProExpertProg · 2024-10-07T22:51:24Z

This PR enables fusing rms_norm and quant ops in the torch.compile backend. It adds all required infrastructure and new fused rms_norm_quant kernels. Only static FP8 quantization is supported in this PR, with more formats and datatypes to be added later to keep this PR as short as possible.

To enable fusion, TORCH_COMPILE_LEVEL needs to be at least 3, and the RMSNorm custom op needs to be enabled (by setting TORCH_CUSTOM_OPS either to all or +rms_norm). TORCH_ENABLE_FUSION needs to be set as well (on by default).

This PR gives roughly 2% end-to-end speedups: TODO detailed numbers.

BEFORE SUBMITTING, PLEASE READ THE CHECKLIST BELOW AND FILL IN THE DESCRIPTION ABOVE

PR Checklist (Click to Expand)

Thank you for your contribution to vLLM! Before submitting the pull request, please ensure the PR meets the following criteria. This helps vLLM maintain the code quality and improve the efficiency of the review process.

PR Title and Classification

Only specific types of PRs will be reviewed. The PR title is prefixed appropriately to indicate the type of change. Please use one of the following:

[Bugfix] for bug fixes.
[CI/Build] for build or continuous integration improvements.
[Doc] for documentation fixes and improvements.
[Model] for adding a new model or improving an existing model. Model name should appear in the title.
[Frontend] For changes on the vLLM frontend (e.g., OpenAI API server, LLM class, etc.)
[Kernel] for changes affecting CUDA kernels or other compute kernels.
[Core] for changes in the core vLLM logic (e.g., LLMEngine, AsyncLLMEngine, Scheduler, etc.)
[Hardware][Vendor] for hardware-specific changes. Vendor name should appear in the prefix (e.g., [Hardware][AMD]).
[Misc] for PRs that do not fit the above categories. Please use this sparingly.

Note: If the PR spans more than one category, please include all relevant prefixes.

Code Quality

The PR need to meet the following code quality standards:

We adhere to Google Python style guide and Google C++ style guide.
Pass all linter checks. Please use format.sh to format your code.
The code need to be well-documented to ensure future contributors can easily understand the code.
Include sufficient tests to ensure the project to stay correct and robust. This includes both unit tests and integration tests.
Please add documentation to docs/source/ if the PR modifies the user-facing behaviors of vLLM. It helps vLLM user understand and utilize the new features or changes.

Adding or changing kernels

Each custom kernel needs a schema and one or more implementations to be registered with PyTorch.

Make sure custom ops are registered following PyTorch guidelines: Custom C++ and CUDA Operators and The Custom Operators Manual
Custom operations that return Tensors require meta-functions. Meta-functions should be implemented and registered in python so that dynamic dims can be handled automatically. See above documents for a description of meta-functions.
Use torch.libary.opcheck() to test the function registration and meta-function for any registered ops. See tests/kernels for examples.
When changing the C++ signature of an existing op, the schema must be updated to reflect the changes.
If a new custom type is needed, see the following document: Custom Class Support in PT2.

Notes for Large Changes

Please keep the changes as concise as possible. For major architectural changes (>500 LOC excluding kernel/data/config/test), we would expect a GitHub issue (RFC) discussing the technical design and justification. Otherwise, we will tag it with rfc-required and might not go through the PR.

What to Expect for the Reviews

The goal of the vLLM team is to be a transparent reviewing machine. We would like to make the review process transparent and efficient and make sure no contributor feel confused or frustrated. However, the vLLM team is small, so we need to prioritize some PRs over others. Here is what you can expect from the review process:

After the PR is submitted, the PR will be assigned to a reviewer. Every reviewer will pick up the PRs based on their expertise and availability.
After the PR is assigned, the reviewer will provide status update every 2-3 days. If the PR is not reviewed within 7 days, please feel free to ping the reviewer or the vLLM team.
After the review, the reviewer will put an action-required label on the PR if there are changes required. The contributor should address the comments and ping the reviewer to re-review the PR.
Please respond to all comments within a reasonable time frame. If a comment isn't clear or you disagree with a suggestion, feel free to ask for clarification or discuss the suggestion.

Thank You

Finally, thank you for taking the time to read these guidelines and for your interest in contributing to vLLM. Your contributions make vLLM a great tool for everyone!

github-actions · 2024-10-07T22:51:35Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

Add ready label to the PR
Enable auto-merge.

🚀

Signed-off-by: luka <luka@neuralmagic.com>

mergify · 2024-11-01T18:51:49Z

This pull request has merge conflicts that must be resolved before it can be
merged. @ProExpertProg please rebase it. https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: luka <luka@neuralmagic.com>

Signed-off-by: luka <luka@neuralmagic.com> # Conflicts: # vllm/compilation/backends.py

Signed-off-by: luka <luka@neuralmagic.com>

youkaichao · 2024-11-08T17:27:54Z

vllm/compilation/fusion.py

+                   for node in match.nodes)
+
+    def __call__(self, graph: torch.fx.Graph):
+        self.dump_graph(graph, "before_fusion")


we can use pytorch's built-in lazy_format_graph_code, see

vllm/vllm/compilation/backends.py

Line 365 in f677862

logger.debug("%s", lazy_format_graph_code("before split", self.graph))

For me, I like the graph going to a file because I can keep around multiple versions, compare them, and navigate more easily. If it's all printed to the console, I'd most likely need to copy each of the graphs to a file manually.

But maybe we can make it configurable so we could do both?

youkaichao · 2024-11-08T17:31:42Z

vllm/compilation/inductor_pass.py

+
+            logger.info("Printing graph to %s", filepath)
+            with open(filepath, "w") as f:
+                src = graph.python_code(root_module="self", verbose=True).src


thinks is quite similar to what i did in https://github.com/thuml/depyf . I did quite a lot hacking to make sure the dumped code is readable and parsable by IDEs.

I think the logging infra can be designed together with compilation cache infra.

youkaichao

LGTM, I left some comments, but we don't need to address them in this PR. thanks for your hard work!

tests/compile/test_fusion.py

Signed-off-by: luka <luka@neuralmagic.com> Co-authored-by: youkaichao <youkaichao@126.com> Signed-off-by: Loc Huynh <jc1da.3011@gmail.com>

Signed-off-by: luka <luka@neuralmagic.com> Co-authored-by: youkaichao <youkaichao@126.com> Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>

Signed-off-by: luka <luka@neuralmagic.com> Co-authored-by: youkaichao <youkaichao@126.com> Signed-off-by: Sumit Dubey <sumit.dubey2@ibm.com>

Signed-off-by: luka <luka@neuralmagic.com> Co-authored-by: youkaichao <youkaichao@126.com>

Signed-off-by: luka <luka@neuralmagic.com> Co-authored-by: youkaichao <youkaichao@126.com> Signed-off-by: LeiWang1999 <leiwang1999@outlook.com>

youkaichao added 18 commits October 3, 2024 14:13

adapt for dynamo

d5c329d

fix tpu

12e29fe

add backend

504bd6c

add use_custom_dispatcher

6353613

update wrapper

77ae8e7

update envs

4d99a58

update custom op

2b79376

support llama

7dfddcd

update plugins

abd1a65

update model runner

ce1907f

add support

e1ea867

add files

511e07b

fix not use_custom_dispatcher

3bb8950

Merge branch 'main' into compile_integration

c4d7189

do not test inductor

ed573fa

add compile context

93ef0b5

remove model reference

3cd40db

lint

4e28930

youkaichao added 11 commits October 7, 2024 16:52

change levels

2ac7274

Merge branch 'main' into compile_integration

34fe820

add levels

a3c947e

use const

1a41c57

use const

db61567

use const

275ede9

use const

d1f084d

use const

326c5b4

use const

9b7b0f3

use const

9cfa70c

use const

e819be7

ProExpertProg added 4 commits October 31, 2024 18:56

Add redundant reshapes removal pass.

a252997

Signed-off-by: luka <luka@neuralmagic.com>

Fix graph dumping when TP not initialized

1b9717f

Signed-off-by: luka <luka@neuralmagic.com>

Reshape add edge-cases

daca890

Signed-off-by: luka <luka@neuralmagic.com>

Singleton pattern matcher for fusion pass

429db0a

Signed-off-by: luka <luka@neuralmagic.com>

mergify bot added the needs-rebase label Nov 1, 2024

ProExpertProg added 3 commits November 8, 2024 16:02

singleton fusion pass

e0b904e

Signed-off-by: luka <luka@neuralmagic.com>

Merge remote-tracking branch 'upstream/main' into luka/rms-norm-fusion

d73933b

Signed-off-by: luka <luka@neuralmagic.com> # Conflicts: # vllm/compilation/backends.py

format

d9375df

Signed-off-by: luka <luka@neuralmagic.com>

mergify bot removed the needs-rebase label Nov 8, 2024

bnellnm approved these changes Nov 8, 2024

View reviewed changes

Add print

d0a9e37

Signed-off-by: luka <luka@neuralmagic.com>

youkaichao reviewed Nov 8, 2024

View reviewed changes

youkaichao approved these changes Nov 8, 2024

View reviewed changes

tlrmchlsmth reviewed Nov 8, 2024

View reviewed changes

tests/compile/test_fusion.py Show resolved Hide resolved

tlrmchlsmth approved these changes Nov 8, 2024

View reviewed changes

mgoin enabled auto-merge (squash) November 8, 2024 20:52

mgoin approved these changes Nov 8, 2024

View reviewed changes

mgoin merged commit 4f93dfe into vllm-project:main Nov 8, 2024
72 checks passed

JC1DA pushed a commit to JC1DA/vllm that referenced this pull request Nov 11, 2024

[torch.compile] Fuse RMSNorm with quant (vllm-project#9138)

cec4efa

Signed-off-by: luka <luka@neuralmagic.com> Co-authored-by: youkaichao <youkaichao@126.com> Signed-off-by: Loc Huynh <jc1da.3011@gmail.com>

charlifu mentioned this pull request Dec 2, 2024

[Feature][Hardware][AMD] Enable level 3 compilation on rocm #10836

Closed

sleepwalker2017 pushed a commit to sleepwalker2017/vllm that referenced this pull request Dec 13, 2024

[torch.compile] Fuse RMSNorm with quant (vllm-project#9138)

5c5676b

Signed-off-by: luka <luka@neuralmagic.com> Co-authored-by: youkaichao <youkaichao@126.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[torch.compile] Fuse RMSNorm with quant #9138

[torch.compile] Fuse RMSNorm with quant #9138

ProExpertProg commented Oct 7, 2024 •

edited

Loading

github-actions bot commented Oct 7, 2024

mergify bot commented Nov 1, 2024

youkaichao Nov 8, 2024

ProExpertProg Nov 8, 2024

youkaichao Nov 8, 2024

youkaichao left a comment

[torch.compile] Fuse RMSNorm with quant #9138

[torch.compile] Fuse RMSNorm with quant #9138

Conversation

ProExpertProg commented Oct 7, 2024 • edited Loading

PR Title and Classification

Code Quality

Adding or changing kernels

Notes for Large Changes

What to Expect for the Reviews

Thank You

github-actions bot commented Oct 7, 2024

mergify bot commented Nov 1, 2024

youkaichao Nov 8, 2024

Choose a reason for hiding this comment

ProExpertProg Nov 8, 2024

Choose a reason for hiding this comment

youkaichao Nov 8, 2024

Choose a reason for hiding this comment

youkaichao left a comment

Choose a reason for hiding this comment

ProExpertProg commented Oct 7, 2024 •

edited

Loading