Clarification on the utilization of ggml_gemm_*
in llama.cpp
#12423
Unanswered
skykongkong8
asked this question in
Q&A
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hello everyone,
I am reaching out to seek clarification regarding the current utilization of the
ggml_gemm_q4_0_4x8_q8_0
function within the llama.cpp project. Upon examining the codebase, it appears that theGGML_OP_MULMAT
operation is primarily associated withggml_dot_*
kernels in theggml_compute_forward_mul_mat
function, rather than theggml_gemm kernels
.In my experiments on an aarch64 system (specifically, a Galaxy S25U using CPU only), I observed significant performance improvements when employing the
ggml_gemm
kernel over theggml_dot
kernel (#thread = 1, but still it is noteworthy I believe), as evidenced by higher GFLOPS metrics in the deprecated matmul benchmark tests (from the past llama.cpp branch). This result is reported on here as well. This raises the question of whether theggml_gemm_q4_0_4x8_q8_0
function has been deprecated or is no longer in active use.I understand that on both NUMA and non-NUMA systems, especially when model weights are stored in a transposed format, utilizing the
ggml_dot
approach might offer practical advantages. However, given the performance disparities observed in my benchmarks, I am curious about the rationale behind favoringggml_dot
overggml_gemm
"always" in the current implementation.Could you please provide insights into the following:
Current Status: Is the
ggml_gemm_q4_0_4x8_q8_0
function still actively used in the llama.cpp codebase, or has it been deprecated in favor of other implementations? (Or am I the only newbie who can't find it..?)Design Considerations: What are the primary reasons for associating the GGML_OP_MULMAT operation with ggml_dot kernels instead of ggml_gemm kernels? Are there specific architectural or performance considerations that influenced this decision?
I appreciate the time and effort the team invests in maintaining and enhancing llama.cpp and look forward to your insights on this matter.
Thanks!
Beta Was this translation helpful? Give feedback.
All reactions