You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hello
I am trying to run mistralai_Mistral-Small-3.1-24B-Instruct-2503-IQ4_XS.gguf in my windows 10 PC with 2x 2070Supers. If I set -sm layer the model loads but crashes with the first prompt. if I set it to row, it works, but seems to be slower that it should be if using the GPUs. Can someone please help me understand what I am doing wrong, or why it will not work. The first few words outputted are super fast in layer mode before the system crashes
Thank you
The output is the same for the most part with a handful of differences.
For row:
.
.
load_tensors: offloading 40 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 41/41 layers to GPU
load_tensors: CUDA0_Split model buffer size = 5929.22 MiB
load_tensors: CUDA1_Split model buffer size = 5889.53 MiB
load_tensors: CUDA0 model buffer size = 0.82 MiB
load_tensors: CUDA1 model buffer size = 0.76 MiB
load_tensors: CPU_Mapped model buffer size = 340.00 MiB
'
'
llama_context: KV self size = 640.00 MiB, K (f16): 320.00 MiB, V (f16): 320.00 MiB
llama_context: CUDA0 compute buffer size = 300.00 MiB
llama_context: CUDA1 compute buffer size = 300.00 MiB
llama_context: CUDA_Host compute buffer size = 18.01 MiB
llama_context: graph nodes = 1366
llama_context: graph splits = 3
.
.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
Hello
I am trying to run mistralai_Mistral-Small-3.1-24B-Instruct-2503-IQ4_XS.gguf in my windows 10 PC with 2x 2070Supers. If I set -sm layer the model loads but crashes with the first prompt. if I set it to row, it works, but seems to be slower that it should be if using the GPUs. Can someone please help me understand what I am doing wrong, or why it will not work. The first few words outputted are super fast in layer mode before the system crashes
Thank you
The output is the same for the most part with a handful of differences.
For row:
.
.
load_tensors: offloading 40 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 41/41 layers to GPU
load_tensors: CUDA0_Split model buffer size = 5929.22 MiB
load_tensors: CUDA1_Split model buffer size = 5889.53 MiB
load_tensors: CUDA0 model buffer size = 0.82 MiB
load_tensors: CUDA1 model buffer size = 0.76 MiB
load_tensors: CPU_Mapped model buffer size = 340.00 MiB
'
'
llama_context: KV self size = 640.00 MiB, K (f16): 320.00 MiB, V (f16): 320.00 MiB
llama_context: CUDA0 compute buffer size = 300.00 MiB
llama_context: CUDA1 compute buffer size = 300.00 MiB
llama_context: CUDA_Host compute buffer size = 18.01 MiB
llama_context: graph nodes = 1366
llama_context: graph splits = 3
.
.
For layer:
.
.
load_tensors: offloading 40 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 41/41 layers to GPU
load_tensors: CUDA0 model buffer size = 5930.04 MiB
load_tensors: CUDA1 model buffer size = 5890.29 MiB
load_tensors: CPU_Mapped model buffer size = 340.00 MiB
.
.
llama_context: KV self size = 640.00 MiB, K (f16): 320.00 MiB, V (f16): 320.00 MiB
llama_context: pipeline parallelism enabled (n_copies=4)
llama_context: CUDA0 compute buffer size = 364.01 MiB
llama_context: CUDA1 compute buffer size = 364.02 MiB
llama_context: CUDA_Host compute buffer size = 42.02 MiB
llama_context: graph nodes = 1366
llama_context: graph splits = 3
Beta Was this translation helpful? Give feedback.
All reactions