Skip to content

Commit 5416002

Browse files
authored
llama : disable pipeline parallelism with nkvo (ggml-org#7265)
1 parent efc8f76 commit 5416002

File tree

1 file changed

+5
-1
lines changed

1 file changed

+5
-1
lines changed

llama.cpp

+5-1
Original file line numberDiff line numberDiff line change
@@ -15849,7 +15849,11 @@ struct llama_context * llama_new_context_with_model(
1584915849
ctx->buf_compute_meta.resize(ggml_tensor_overhead()*LLAMA_MAX_NODES + ggml_graph_overhead_custom(LLAMA_MAX_NODES, false));
1585015850

1585115851
// enabling pipeline parallelism in the scheduler increases memory usage, so it is only done when necessary
15852-
bool pipeline_parallel = llama_get_device_count() > 1 && model->n_gpu_layers > (int)model->hparams.n_layer && model->split_mode == LLAMA_SPLIT_MODE_LAYER;
15852+
bool pipeline_parallel =
15853+
llama_get_device_count() > 1 &&
15854+
model->n_gpu_layers > (int)model->hparams.n_layer &&
15855+
model->split_mode == LLAMA_SPLIT_MODE_LAYER &&
15856+
params.offload_kqv;
1585315857
#ifndef GGML_USE_CUDA
1585415858
// pipeline parallelism requires support for async compute and events
1585515859
// currently this is only implemented in the CUDA backend

0 commit comments

Comments
 (0)