Clarification on the utilization of `ggml_gemm_*` in llama.cpp #12423

skykongkong8 · 2025-03-17T10:39:35Z

skykongkong8
Mar 17, 2025

Hello everyone,

I am reaching out to seek clarification regarding the current utilization of the ggml_gemm_q4_0_4x8_q8_0 function within the llama.cpp project. Upon examining the codebase, it appears that the GGML_OP_MULMAT operation is primarily associated with ggml_dot_* kernels in the ggml_compute_forward_mul_mat function, rather than the ggml_gemm kernels.

In my experiments on an aarch64 system (specifically, a Galaxy S25U using CPU only), I observed significant performance improvements when employing the ggml_gemm kernel over the ggml_dot kernel (#thread = 1, but still it is noteworthy I believe), as evidenced by higher GFLOPS metrics in the deprecated matmul benchmark tests (from the past llama.cpp branch). This result is reported on here as well. This raises the question of whether the ggml_gemm_q4_0_4x8_q8_0 function has been deprecated or is no longer in active use.

I understand that on both NUMA and non-NUMA systems, especially when model weights are stored in a transposed format, utilizing the ggml_dot approach might offer practical advantages. However, given the performance disparities observed in my benchmarks, I am curious about the rationale behind favoring ggml_dot over ggml_gemm "always" in the current implementation.

Could you please provide insights into the following:

Current Status: Is the ggml_gemm_q4_0_4x8_q8_0 function still actively used in the llama.cpp codebase, or has it been deprecated in favor of other implementations? (Or am I the only newbie who can't find it..?)
Design Considerations: What are the primary reasons for associating the GGML_OP_MULMAT operation with ggml_dot kernels instead of ggml_gemm kernels? Are there specific architectural or performance considerations that influenced this decision?

I appreciate the time and effort the team invests in maintaining and enhancing llama.cpp and look forward to your insights on this matter.

Thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clarification on the utilization of `ggml_gemm_*` in llama.cpp #12423

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

Clarification on the utilization of ggml_gemm_* in llama.cpp #12423

skykongkong8 Mar 17, 2025

Replies: 0 comments

Clarification on the utilization of `ggml_gemm_*` in llama.cpp #12423

skykongkong8
Mar 17, 2025