You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
--force-safetensors Set use_safetensors=True while loading the model. This prevents arbitrary code execution.
255
255
--no_use_fast Set use_fast=False while loading the tokenizer (it's True by default). Use this if you have any problems related to use_fast.
256
256
--use_flash_attention_2 Set use_flash_attention_2=True while loading the model.
257
+
--use_eager_attention Set attn_implementation= eager while loading the model.
257
258
258
259
bitsandbytes 4-bit:
259
260
--load-in-4bit Load the model with 4-bit precision (using bitsandbytes).
@@ -263,7 +264,7 @@ bitsandbytes 4-bit:
263
264
264
265
llama.cpp:
265
266
--flash-attn Use flash-attention.
266
-
--tensorcores Use llama-cpp-python compiled with tensor cores support. This increases performance on RTX cards. NVIDIA only.
267
+
--tensorcores NVIDIA only: use llama-cpp-python compiled with tensor cores support. This may increase performance on newer cards.
267
268
--n_ctx N_CTX Size of the prompt context.
268
269
--threads THREADS Number of threads to use.
269
270
--threads-batch THREADS_BATCH Number of threads to use for batches/prompt processing.
@@ -272,7 +273,7 @@ llama.cpp:
272
273
--no-mmap Prevent mmap from being used.
273
274
--mlock Force the system to keep the model in RAM.
274
275
--n-gpu-layers N_GPU_LAYERS Number of layers to offload to the GPU.
275
-
--tensor_split TENSOR_SPLIT Split the model across multiple GPUs. Comma-separated list of proportions. Example: 18,17.
276
+
--tensor_split TENSOR_SPLIT Split the model across multiple GPUs. Comma-separated list of proportions. Example: 60,40.
276
277
--numa Activate NUMA task allocation for llama.cpp.
277
278
--logits_all Needs to be set for perplexity evaluation to work. Otherwise, ignore it, as it makes prompt processing slower.
278
279
--no_offload_kqv Do not offload the K, Q, V to the GPU. This saves VRAM but reduces the performance.
@@ -287,6 +288,8 @@ ExLlamaV2:
287
288
--max_seq_len MAX_SEQ_LEN Maximum sequence length.
288
289
--cfg-cache ExLlamav2_HF: Create an additional cache for CFG negative prompts. Necessary to use CFG with that loader.
289
290
--no_flash_attn Force flash-attention to not be used.
291
+
--no_xformers Force xformers to not be used.
292
+
--no_sdpa Force Torch SDPA to not be used.
290
293
--cache_8bit Use 8-bit cache to save VRAM.
291
294
--cache_4bit Use Q4 cache to save VRAM.
292
295
--num_experts_per_token NUM_EXPERTS_PER_TOKEN Number of experts to use for generation. Applies to MoE models like Mixtral.
@@ -307,6 +310,9 @@ AutoAWQ:
307
310
HQQ:
308
311
--hqq-backend HQQ_BACKEND Backend for the HQQ loader. Valid options: PYTORCH, PYTORCH_COMPILE, ATEN.
309
312
313
+
TensorRT-LLM:
314
+
--cpp-runner Use the ModelRunnerCpp runner, which is faster than the default ModelRunner but doesn't support streaming yet.
315
+
310
316
DeepSpeed:
311
317
--deepspeed Enable the use of DeepSpeed ZeRO-3 for inference via the Transformers integration.
312
318
--nvme-offload-dir NVME_OFFLOAD_DIR DeepSpeed: Directory to use for ZeRO-3 NVME offloading.
@@ -327,6 +333,7 @@ Gradio:
327
333
--gradio-auth-path GRADIO_AUTH_PATH Set the Gradio authentication file path. The file should contain one or more user:password pairs in the same format as above.
328
334
--ssl-keyfile SSL_KEYFILE The path to the SSL certificate key file.
329
335
--ssl-certfile SSL_CERTFILE The path to the SSL certificate cert file.
336
+
--subpath SUBPATH Customize the subpath for gradio, use with reverse proxy
330
337
331
338
API:
332
339
--api Enable the API extension.
@@ -392,18 +399,11 @@ Run `python download-model.py --help` to see all the options.
In August 2023, [Andreessen Horowitz](https://a16z.com/) (a16z) provided a generous grant to encourage and support my independent work on this project. I am **extremely** grateful for their trust and recognition.
In August 2023, [Andreessen Horowitz](https://a16z.com/) (a16z) provided a generous grant to encourage and support my independent work on this project. I am **extremely** grateful for their trust and recognition.
0 commit comments