ggml llama: align structs for memory optimization on 64-bit platforms: #120

Nexesenex · 2024-05-15T13:32:19Z

ggml_type_traits_t (80 -> 72 bytes)
llama_batch (72 -> 64 bytes)
llama_model_params (56 -> 48 bytes)
hash_node (32 -> 24 bytes)
ggml_compute_state (32 -> 24 bytes)
gguf_tensor_info (88 -> 80 bytes)

- ggml_type_traits_t (80 -> 72 bytes) - llama_batch (72 -> 64 bytes) - llama_model_params (56 -> 48 bytes) - hash_node (32 -> 24 bytes) - ggml_compute_state (32 -> 24 bytes) - gguf_tensor_info (88 -> 80 bytes)

* Adding q8_0_r4 We get PP-512(LLaMA-3.1-8B) = 268 t/s on a Ryzen-7950X compared to 175.6 t/s for Q8_0. * q8_0_r4: NEON We get PP-512(LLaMA-3.1-8B) = 112.6 t/s on M2-Max. * q8_0_r4: Zen4 matrix-vector specialization --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>

ggml llama: align structs for memory optimization on 64-bit platforms:

2a9a84b

- ggml_type_traits_t (80 -> 72 bytes) - llama_batch (72 -> 64 bytes) - llama_model_params (56 -> 48 bytes) - hash_node (32 -> 24 bytes) - ggml_compute_state (32 -> 24 bytes) - gguf_tensor_info (88 -> 80 bytes)

Nexesenex merged commit dec7622 into Nexesenex:sidestream May 15, 2024
27 of 33 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ggml llama: align structs for memory optimization on 64-bit platforms: #120

ggml llama: align structs for memory optimization on 64-bit platforms: #120

Nexesenex commented May 15, 2024

ggml llama: align structs for memory optimization on 64-bit platforms: #120

ggml llama: align structs for memory optimization on 64-bit platforms: #120

Conversation

Nexesenex commented May 15, 2024