Skip to content

Commit fe2da09

Browse files
authored
feat: Generic Chat Formats, Tool Calling, and Huggingface Pull Support for Multimodal Models (Obsidian, LLaVA1.6, Moondream) (#1147)
* Test dummy image tags in chat templates * Format and improve types for llava_cpp.py * Add from_pretrained support to llava chat format. * Refactor llava chat format to use a jinja2 * Revert chat format test * Add moondream support (wip) * Update moondream chat format * Update moondream chat format * Update moondream prompt * Add function calling support * Cache last image embed * Add Llava1.6 support * Add nanollava support * Add obisidian support * Remove unnecessary import * Re-order multimodal chat formats * Logits all no longer required for multi-modal models * Update README.md * Update docs * Update README * Fix typo * Update README * Fix typo
1 parent 97fb860 commit fe2da09

File tree

5 files changed

+711
-145
lines changed

5 files changed

+711
-145
lines changed

README.md

Lines changed: 36 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -490,14 +490,15 @@ Due to discrepancies between llama.cpp and HuggingFace's tokenizers, it is requi
490490

491491
### Multi-modal Models
492492

493-
`llama-cpp-python` supports the llava1.5 family of multi-modal models which allow the language model to
494-
read information from both text and images.
493+
`llama-cpp-python` supports such as llava1.5 which allow the language model to read information from both text and images.
495494

496495
You'll first need to download one of the available multi-modal models in GGUF format:
497496

498497
- [llava-v1.5-7b](https://huggingface.co/mys/ggml_llava-v1.5-7b)
499498
- [llava-v1.5-13b](https://huggingface.co/mys/ggml_llava-v1.5-13b)
500499
- [bakllava-1-7b](https://huggingface.co/mys/ggml_bakllava-1)
500+
- [llava-v1.6-34b](https://huggingface.co/cjpais/llava-v1.6-34B-gguf)
501+
- [moondream2](https://huggingface.co/vikhyatk/moondream2)
501502

502503
Then you'll need to use a custom chat handler to load the clip model and process the chat messages and images.
503504

@@ -509,22 +510,52 @@ Then you'll need to use a custom chat handler to load the clip model and process
509510
model_path="./path/to/llava/llama-model.gguf",
510511
chat_handler=chat_handler,
511512
n_ctx=2048, # n_ctx should be increased to accomodate the image embedding
512-
logits_all=True,# needed to make llava work
513513
)
514514
>>> llm.create_chat_completion(
515515
messages = [
516516
{"role": "system", "content": "You are an assistant who perfectly describes images."},
517517
{
518518
"role": "user",
519519
"content": [
520-
{"type": "image_url", "image_url": {"url": "https://.../image.png"}},
521-
{"type" : "text", "text": "Describe this image in detail please."}
520+
{"type" : "text", "text": "What's in this image?"},
521+
{"type": "image_url", "image_url": {"url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg" } }
522522
]
523523
}
524524
]
525525
)
526526
```
527527

528+
You can also pull the model from the Hugging Face Hub using the `from_pretrained` method.
529+
530+
```python
531+
>>> from llama_cpp import Llama
532+
>>> from llama_cpp.llama_chat_format import MoondreamChatHandler
533+
>>> chat_handler = MoondreamChatHandler.from_pretrained(
534+
repo_id="vikhyatk/moondream2",
535+
filename="*mmproj*",
536+
)
537+
>>> llm = Llama.from_pretrained(
538+
repo_id="vikhyatk/moondream2"
539+
filename="*text-model*",
540+
chat_handler=chat_handler,
541+
n_ctx=2048, # n_ctx should be increased to accomodate the image embedding
542+
)
543+
>>> llm.create_chat_completion(
544+
messages = [
545+
{
546+
"role": "user",
547+
"content": [
548+
{"type" : "text", "text": "What's in this image?"},
549+
{"type": "image_url", "image_url": {"url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg" } }
550+
551+
]
552+
}
553+
]
554+
)
555+
```
556+
557+
**Note**: Multi-modal models also support tool calling and JSON mode.
558+
528559
<details>
529560
<summary>Loading a Local Image</summary>
530561

docs/server.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -98,6 +98,8 @@ You'll first need to download one of the available multi-modal models in GGUF fo
9898
- [llava-v1.5-7b](https://huggingface.co/mys/ggml_llava-v1.5-7b)
9999
- [llava-v1.5-13b](https://huggingface.co/mys/ggml_llava-v1.5-13b)
100100
- [bakllava-1-7b](https://huggingface.co/mys/ggml_bakllava-1)
101+
- [llava-v1.6-34b](https://huggingface.co/cjpais/llava-v1.6-34B-gguf)
102+
- [moondream2](https://huggingface.co/vikhyatk/moondream2)
101103

102104
Then when you run the server you'll need to also specify the path to the clip model used for image embedding and the `llava-1-5` chat_format
103105

0 commit comments

Comments
 (0)