Skip to content

Commit 146b489

Browse files
committed
Merge branch 'master' into croco_hf
2 parents 7fef8e7 + 14dec0c commit 146b489

File tree

107 files changed

+4463
-1177
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

107 files changed

+4463
-1177
lines changed

.gitignore

+2
Original file line numberDiff line numberDiff line change
@@ -45,6 +45,8 @@ lcov-report/
4545
tags
4646
.build/
4747
build*
48+
release
49+
debug
4850
!build-info.cmake
4951
!build-info.cpp.in
5052
!build-info.sh

CONTRIBUTING.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -39,7 +39,7 @@
3939
4040
_(NOTE: this guideline is yet to be applied to the `llama.cpp` codebase. New code should follow this guideline.)_
4141
42-
- Try to follow the existing patterns in the code (indentation, spaces, etc.). In case of doubt use `clang-format` to format the added code
42+
- Try to follow the existing patterns in the code (indentation, spaces, etc.). In case of doubt use `clang-format` (from clang-tools v15+) to format the added code
4343
- For anything not covered in the current guidelines, refer to the [C++ Core Guidelines](https://isocpp.github.io/CppCoreGuidelines/CppCoreGuidelines)
4444
- Tensors store data in row-major order. We refer to dimension 0 as columns, 1 as rows, 2 as matrices
4545
- Matrix multiplication is unconventional: [`C = ggml_mul_mat(ctx, A, B)`](https://github.com/ggml-org/llama.cpp/blob/880e352277fc017df4d5794f0c21c44e1eae2b84/ggml.h#L1058-L1064) means $C^T = A B^T \Leftrightarrow C = B A^T.$

Makefile

+12
Original file line numberDiff line numberDiff line change
@@ -680,6 +680,10 @@ ifdef GGML_CUDA_CCBIN
680680
MK_NVCCFLAGS += -ccbin $(GGML_CUDA_CCBIN)
681681
endif # GGML_CUDA_CCBIN
682682

683+
ifdef GGML_CUDA_NO_FA
684+
MK_NVCCFLAGS += -DGGML_CUDA_NO_FA
685+
endif # GGML_CUDA_NO_FA
686+
683687
ifdef GGML_CUDA_FA_ALL_QUANTS
684688
MK_NVCCFLAGS += -DGGML_CUDA_FA_ALL_QUANTS
685689
endif # GGML_CUDA_FA_ALL_QUANTS
@@ -800,6 +804,10 @@ ifdef GGML_CUDA_NO_PEER_COPY
800804
HIPFLAGS += -DGGML_CUDA_NO_PEER_COPY
801805
endif # GGML_CUDA_NO_PEER_COPY
802806

807+
ifdef GGML_CUDA_NO_FA
808+
HIPFLAGS += -DGGML_CUDA_NO_FA
809+
endif # GGML_CUDA_NO_FA
810+
803811
OBJ_GGML_EXT += ggml/src/ggml-cuda/ggml-cuda.o
804812
OBJ_GGML_EXT += $(patsubst %.cu,%.o,$(wildcard ggml/src/ggml-cuda/*.cu))
805813
OBJ_GGML_EXT += $(OBJ_CUDA_TMPL)
@@ -876,6 +884,10 @@ ifdef GGML_CUDA_NO_PEER_COPY
876884
MUSAFLAGS += -DGGML_CUDA_NO_PEER_COPY
877885
endif # GGML_CUDA_NO_PEER_COPY
878886

887+
ifdef GGML_CUDA_NO_FA
888+
MUSAFLAGS += -DGGML_CUDA_NO_FA
889+
endif # GGML_CUDA_NO_FA
890+
879891
ifdef GGML_CUDA_FA_ALL_QUANTS
880892
MUSAFLAGS += -DGGML_CUDA_FA_ALL_QUANTS
881893
endif # GGML_CUDA_FA_ALL_QUANTS

README.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -219,7 +219,7 @@ Instructions for adding support for new models: [HOWTO-add-model.md](docs/develo
219219
- [llama_cpp_canister](https://github.com/onicai/llama_cpp_canister) - llama.cpp as a smart contract on the Internet Computer, using WebAssembly
220220
- [llama-swap](https://github.com/mostlygeek/llama-swap) - transparent proxy that adds automatic model switching with llama-server
221221
- [Kalavai](https://github.com/kalavai-net/kalavai-client) - Crowdsource end to end LLM deployment at any scale
222-
222+
- [llmaz](https://github.com/InftyAI/llmaz) - ☸️ Easy, advanced inference platform for large language models on Kubernetes.
223223
</details>
224224

225225
<details>

common/arg.cpp

+8-3
Original file line numberDiff line numberDiff line change
@@ -813,13 +813,18 @@ common_params_context common_params_parser_init(common_params & params, llama_ex
813813
).set_env("LLAMA_ARG_FLASH_ATTN"));
814814
add_opt(common_arg(
815815
{"-p", "--prompt"}, "PROMPT",
816-
ex == LLAMA_EXAMPLE_MAIN
817-
? "prompt to start generation with\nif -cnv is set, this will be used as system prompt"
818-
: "prompt to start generation with",
816+
"prompt to start generation with; for system message, use -sys",
819817
[](common_params & params, const std::string & value) {
820818
params.prompt = value;
821819
}
822820
).set_excludes({LLAMA_EXAMPLE_SERVER}));
821+
add_opt(common_arg(
822+
{"-sys", "--system-prompt"}, "PROMPT",
823+
"system prompt to use with model (if applicable, depending on chat template)",
824+
[](common_params & params, const std::string & value) {
825+
params.system_prompt = value;
826+
}
827+
).set_examples({LLAMA_EXAMPLE_MAIN}));
823828
add_opt(common_arg(
824829
{"--no-perf"},
825830
string_format("disable internal libllama performance timings (default: %s)", params.no_perf ? "true" : "false"),

common/common.h

+1
Original file line numberDiff line numberDiff line change
@@ -261,6 +261,7 @@ struct common_params {
261261
std::string hf_repo = ""; // HF repo // NOLINT
262262
std::string hf_file = ""; // HF file // NOLINT
263263
std::string prompt = ""; // NOLINT
264+
std::string system_prompt = ""; // NOLINT
264265
std::string prompt_file = ""; // store the external prompt file name // NOLINT
265266
std::string path_prompt_cache = ""; // path to file for saving/loading prompt eval state // NOLINT
266267
std::string input_prefix = ""; // string to prefix user inputs with // NOLINT

convert_hf_to_gguf.py

+8-3
Original file line numberDiff line numberDiff line change
@@ -699,6 +699,9 @@ def get_vocab_base_pre(self, tokenizer) -> str:
699699
if chkhsh == "b3f499bb4255f8ca19fccd664443283318f2fd2414d5e0b040fbdd0cc195d6c5":
700700
# ref: https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B
701701
res = "deepseek-r1-qwen"
702+
if chkhsh == "ccc2ef013c104be7bae2965776d611e1d7a8a2a9c547dd93a682c9a9fc80352e":
703+
# ref: https://huggingface.co/Xenova/gpt-4o
704+
res = "gpt-4o"
702705

703706
if res is None:
704707
logger.warning("\n")
@@ -2512,7 +2515,8 @@ def set_gguf_parameters(self):
25122515
rms_eps = self.find_hparam(["rms_norm_eps"])
25132516
max_pos_embds = self.find_hparam(["n_positions", "max_position_embeddings"])
25142517
orig_max_pos_embds = self.find_hparam(["original_max_position_embeddings"])
2515-
rope_dims = n_embd // n_head
2518+
rot_pct = self.hparams.get("partial_rotary_factor", 1.0)
2519+
rope_dims = int(rot_pct * n_embd) // n_head
25162520

25172521
self.gguf_writer.add_context_length(max_pos_embds)
25182522
self.gguf_writer.add_rope_scaling_orig_ctx_len(orig_max_pos_embds)
@@ -2536,7 +2540,8 @@ def generate_extra_tensors(self) -> Iterable[tuple[str, Tensor]]:
25362540
n_head = self.find_hparam(["num_attention_heads", "n_head"])
25372541
max_pos_embds = self.find_hparam(["n_positions", "max_position_embeddings"])
25382542
orig_max_pos_embds = self.find_hparam(["original_max_position_embeddings"])
2539-
rope_dims = n_embd // n_head
2543+
rot_pct = self.hparams.get("partial_rotary_factor", 1.0)
2544+
rope_dims = int(rot_pct * n_embd) // n_head
25402545

25412546
# write rope scaling for long context (128k) model
25422547
rope_scaling = self.find_hparam(['rope_scaling'], True)
@@ -2565,7 +2570,7 @@ def generate_extra_tensors(self) -> Iterable[tuple[str, Tensor]]:
25652570
raise KeyError('Missing the required key rope_scaling.long_factor or rope_scaling_short_factor')
25662571

25672572
if len(long_factors) != len(short_factors) or len(long_factors) != rope_dims / 2:
2568-
raise ValueError(f'The length of rope long and short factors must be {rope_dims / 2}')
2573+
raise ValueError(f'The length of rope long and short factors must be {rope_dims / 2}. long_factors = {len(long_factors)}, short_factors = {len(short_factors)}.')
25692574

25702575
yield (self.format_tensor_name(gguf.MODEL_TENSOR.ROPE_FACTORS_LONG), torch.tensor(long_factors, dtype=torch.float32))
25712576
yield (self.format_tensor_name(gguf.MODEL_TENSOR.ROPE_FACTORS_SHORT), torch.tensor(short_factors, dtype=torch.float32))

convert_hf_to_gguf_update.py

+5
Original file line numberDiff line numberDiff line change
@@ -109,6 +109,7 @@ class TOKENIZER_TYPE(IntEnum):
109109
{"name": "megrez", "tokt": TOKENIZER_TYPE.BPE, "repo": "https://huggingface.co/Infinigence/Megrez-3B-Instruct"},
110110
{"name": "deepseek-v3", "tokt": TOKENIZER_TYPE.BPE, "repo": "https://huggingface.co/deepseek-ai/DeepSeek-V3"},
111111
{"name": "deepseek-r1-qwen", "tokt": TOKENIZER_TYPE.BPE, "repo": "https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B"},
112+
{"name": "gpt-4o", "tokt": TOKENIZER_TYPE.BPE, "repo": "https://huggingface.co/Xenova/gpt-4o", },
112113
]
113114

114115

@@ -131,6 +132,10 @@ def download_model(model):
131132

132133
files = ["config.json", "tokenizer.json", "tokenizer_config.json"]
133134

135+
if name == "gpt-4o":
136+
# Xenova/gpt-4o is tokenizer-only, it does not contain config.json
137+
files = ["tokenizer.json", "tokenizer_config.json"]
138+
134139
if tokt == TOKENIZER_TYPE.SPM:
135140
files.append("tokenizer.model")
136141

docs/backend/SYCL.md

+14-2
Original file line numberDiff line numberDiff line change
@@ -42,6 +42,16 @@ The following release is verified with good quality:
4242

4343
## News
4444

45+
- 2025.2
46+
- Optimize MUL_MAT Q4_0 on Intel GPU for all dGPUs and built-in GPUs since MTL. Increase the performance of LLM (llama-2-7b.Q4_0.gguf) 21%-87% on Intel GPUs (MTL, ARL-H, Arc, Flex, PVC).
47+
|GPU|Base tokens/s|Increased tokens/s|Percent|
48+
|-|-|-|-|
49+
|PVC 1550|39|73|+87%|
50+
|Flex 170|39|50|+28%|
51+
|Arc770|42|55|+30%|
52+
|MTL|13|16|+23%|
53+
|ARL-H|14|17|+21%|
54+
4555
- 2024.11
4656
- Use syclcompat to improve the performance on some platforms. This requires to use oneAPI 2025.0 or newer.
4757

@@ -97,8 +107,8 @@ SYCL backend supports Intel GPU Family:
97107
| Intel Data Center Max Series | Support | Max 1550, 1100 |
98108
| Intel Data Center Flex Series | Support | Flex 170 |
99109
| Intel Arc Series | Support | Arc 770, 730M, Arc A750 |
100-
| Intel built-in Arc GPU | Support | built-in Arc GPU in Meteor Lake |
101-
| Intel iGPU | Support | iGPU in 13700k, i5-1250P, i7-1260P, i7-1165G7 |
110+
| Intel built-in Arc GPU | Support | built-in Arc GPU in Meteor Lake, Arrow Lake |
111+
| Intel iGPU | Support | iGPU in 13700k,iGPU in 13400, i5-1250P, i7-1260P, i7-1165G7 |
102112

103113
*Notes:*
104114

@@ -660,8 +670,10 @@ use 1 SYCL GPUs: [0] with Max compute units:512
660670
| Name | Value | Function |
661671
|-------------------|------------------|---------------------------------------------------------------------------------------------------------------------------|
662672
| GGML_SYCL_DEBUG | 0 (default) or 1 | Enable log function by macro: GGML_SYCL_DEBUG |
673+
| GGML_SYCL_DISABLE_OPT | 0 (default) or 1 | Disable optimize features based on Intel GPU type, to compare the performance increase |
663674
| ZES_ENABLE_SYSMAN | 0 (default) or 1 | Support to get free memory of GPU by sycl::aspect::ext_intel_free_memory.<br>Recommended to use when --split-mode = layer |
664675

676+
665677
## Known Issues
666678

667679
- `Split-mode:[row]` is not supported.

0 commit comments

Comments
 (0)