@@ -490,14 +490,15 @@ Due to discrepancies between llama.cpp and HuggingFace's tokenizers, it is requi
490
490
491
491
### Multi-modal Models
492
492
493
- ` llama-cpp-python ` supports the llava1.5 family of multi-modal models which allow the language model to
494
- read information from both text and images.
493
+ ` llama-cpp-python ` supports such as llava1.5 which allow the language model to read information from both text and images.
495
494
496
495
You'll first need to download one of the available multi-modal models in GGUF format:
497
496
498
497
- [ llava-v1.5-7b] ( https://huggingface.co/mys/ggml_llava-v1.5-7b )
499
498
- [ llava-v1.5-13b] ( https://huggingface.co/mys/ggml_llava-v1.5-13b )
500
499
- [ bakllava-1-7b] ( https://huggingface.co/mys/ggml_bakllava-1 )
500
+ - [ llava-v1.6-34b] ( https://huggingface.co/cjpais/llava-v1.6-34B-gguf )
501
+ - [ moondream2] ( https://huggingface.co/vikhyatk/moondream2 )
501
502
502
503
Then you'll need to use a custom chat handler to load the clip model and process the chat messages and images.
503
504
@@ -509,7 +510,6 @@ Then you'll need to use a custom chat handler to load the clip model and process
509
510
model_path = " ./path/to/llava/llama-model.gguf" ,
510
511
chat_handler = chat_handler,
511
512
n_ctx = 2048 , # n_ctx should be increased to accomodate the image embedding
512
- logits_all = True ,# needed to make llava work
513
513
)
514
514
>> > llm.create_chat_completion(
515
515
messages = [
@@ -525,6 +525,37 @@ Then you'll need to use a custom chat handler to load the clip model and process
525
525
)
526
526
```
527
527
528
+ You can also pull the model from the Hugging Face Hub using the ` from_pretrained ` method.
529
+
530
+ ``` python
531
+ >> > from llama_cpp import Llama
532
+ >> > from llama_cpp.llama_chat_format import MoondreamChatHandler
533
+ >> > chat_handler = MoondreamChatHandler.from_pretrained(
534
+ repo_id = " vikhyatk/moondream2"
535
+ filename = " *mmproj*" ,
536
+ )
537
+ >> > llm = Llama.from_pretrained(
538
+ repo_id = " vikhyatk/moondream2"
539
+ filename = " *text-model*" ,
540
+ chat_handler = chat_handler,
541
+ n_ctx = 2048 , # n_ctx should be increased to accomodate the image embedding
542
+ )
543
+ >> > llm.create_chat_completion(
544
+ messages = [
545
+ {" role" : " system" , " content" : " You are an assistant who perfectly describes images." },
546
+ {
547
+ " role" : " user" ,
548
+ " content" : [
549
+ {" type" : " image_url" , " image_url" : {" url" : " https://.../image.png" }},
550
+ {" type" : " text" , " text" : " Describe this image in detail please." }
551
+ ]
552
+ }
553
+ ]
554
+ )
555
+ ```
556
+
557
+ ** Note** : Multi-modal models also support tool calling and JSON mode.
558
+
528
559
<details >
529
560
<summary >Loading a Local Image</summary >
530
561
0 commit comments