Skip to content

Commit f14bcee

Browse files
docs: update multi-modal docs (#30880)
Co-authored-by: Sydney Runkle <54324534+sydney-runkle@users.noreply.github.com>
1 parent 98c357b commit f14bcee

File tree

4 files changed

+701
-166
lines changed

4 files changed

+701
-166
lines changed

docs/docs/concepts/multimodality.mdx

+65-15
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,10 @@
1515
* [Messages](/docs/concepts/messages)
1616
:::
1717

18-
Multimodal support is still relatively new and less common, model providers have not yet standardized on the "best" way to define the API. As such, LangChain's multimodal abstractions are lightweight and flexible, designed to accommodate different model providers' APIs and interaction patterns, but are **not** standardized across models.
18+
LangChain supports multimodal data as input to chat models:
19+
20+
1. Following provider-specific formats
21+
2. Adhering to a cross-provider standard (see [how-to guides](/docs/how_to/#multimodal) for detail)
1922

2023
### How to use multimodal models
2124

@@ -26,38 +29,85 @@ Multimodal support is still relatively new and less common, model providers have
2629

2730
#### Inputs
2831

29-
Some models can accept multimodal inputs, such as images, audio, video, or files. The types of multimodal inputs supported depend on the model provider. For instance, [Google's Gemini](/docs/integrations/chat/google_generative_ai/) supports documents like PDFs as inputs.
32+
Some models can accept multimodal inputs, such as images, audio, video, or files.
33+
The types of multimodal inputs supported depend on the model provider. For instance,
34+
[OpenAI](/docs/integrations/chat/openai/),
35+
[Anthropic](/docs/integrations/chat/anthropic/), and
36+
[Google Gemini](/docs/integrations/chat/google_generative_ai/)
37+
support documents like PDFs as inputs.
38+
39+
The gist of passing multimodal inputs to a chat model is to use content blocks that
40+
specify a type and corresponding data. For example, to pass an image to a chat model
41+
as URL:
3042

31-
Most chat models that support **multimodal inputs** also accept those values in OpenAI's content blocks format. So far this is restricted to image inputs. For models like Gemini which support video and other bytes input, the APIs also support the native, model-specific representations.
43+
```python
44+
from langchain_core.messages import HumanMessage
45+
46+
message = HumanMessage(
47+
content=[
48+
{"type": "text", "text": "Describe the weather in this image:"},
49+
{
50+
"type": "image",
51+
"source_type": "url",
52+
"url": "https://...",
53+
},
54+
],
55+
)
56+
response = model.invoke([message])
57+
```
3258

33-
The gist of passing multimodal inputs to a chat model is to use content blocks that specify a type and corresponding data. For example, to pass an image to a chat model:
59+
We can also pass the image as in-line data:
3460

3561
```python
3662
from langchain_core.messages import HumanMessage
3763

3864
message = HumanMessage(
3965
content=[
40-
{"type": "text", "text": "describe the weather in this image"},
41-
{"type": "image_url", "image_url": {"url": image_url}},
66+
{"type": "text", "text": "Describe the weather in this image:"},
67+
{
68+
"type": "image",
69+
"source_type": "base64",
70+
"data": "<base64 string>",
71+
"mime_type": "image/jpeg",
72+
},
4273
],
4374
)
4475
response = model.invoke([message])
4576
```
4677

47-
:::caution
48-
The exact format of the content blocks may vary depending on the model provider. Please refer to the chat model's
49-
integration documentation for the correct format. Find the integration in the [chat model integration table](/docs/integrations/chat/).
50-
:::
78+
To pass a PDF file as in-line data (or URL, as supported by providers such as
79+
Anthropic), just change `"type"` to `"file"` and `"mime_type"` to `"application/pdf"`.
5180

52-
#### Outputs
81+
See the [how-to guides](/docs/how_to/#multimodal) for more detail.
5382

54-
Virtually no popular chat models support multimodal outputs at the time of writing (October 2024).
83+
Most chat models that support multimodal **image** inputs also accept those values in
84+
OpenAI's [Chat Completions format](https://platform.openai.com/docs/guides/images?api-mode=chat):
5585

56-
The only exception is OpenAI's chat model ([gpt-4o-audio-preview](/docs/integrations/chat/openai/)), which can generate audio outputs.
86+
```python
87+
from langchain_core.messages import HumanMessage
88+
89+
message = HumanMessage(
90+
content=[
91+
{"type": "text", "text": "Describe the weather in this image:"},
92+
{"type": "image_url", "image_url": {"url": image_url}},
93+
],
94+
)
95+
response = model.invoke([message])
96+
```
97+
98+
Otherwise, chat models will typically accept the native, provider-specific content
99+
block format. See [chat model integrations](/docs/integrations/chat/) for detail
100+
on specific providers.
101+
102+
103+
#### Outputs
57104

58-
Multimodal outputs will appear as part of the [AIMessage](/docs/concepts/messages/#aimessage) response object.
105+
Some chat models support multimodal outputs, such as images and audio. Multimodal
106+
outputs will appear as part of the [AIMessage](/docs/concepts/messages/#aimessage)
107+
response object. See for example:
59108

60-
Please see the [ChatOpenAI](/docs/integrations/chat/openai/) for more information on how to use multimodal outputs.
109+
- Generating [audio outputs](/docs/integrations/chat/openai/#audio-generation-preview) with OpenAI;
110+
- Generating [image outputs](/docs/integrations/chat/google_generative_ai/#image-generation) with Google Gemini.
61111

62112
#### Tools
63113

docs/docs/how_to/index.mdx

+2
Original file line numberDiff line numberDiff line change
@@ -50,6 +50,7 @@ See [supported integrations](/docs/integrations/chat/) for details on getting st
5050
- [How to: force a specific tool call](/docs/how_to/tool_choice)
5151
- [How to: work with local models](/docs/how_to/local_llms)
5252
- [How to: init any model in one line](/docs/how_to/chat_models_universal_init/)
53+
- [How to: pass multimodal data directly to models](/docs/how_to/multimodal_inputs/)
5354

5455
### Messages
5556

@@ -67,6 +68,7 @@ See [supported integrations](/docs/integrations/chat/) for details on getting st
6768
- [How to: use few shot examples in chat models](/docs/how_to/few_shot_examples_chat/)
6869
- [How to: partially format prompt templates](/docs/how_to/prompts_partial)
6970
- [How to: compose prompts together](/docs/how_to/prompts_composition)
71+
- [How to: use multimodal prompts](/docs/how_to/multimodal_prompts/)
7072

7173
### Example selectors
7274

0 commit comments

Comments
 (0)