You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/docs/concepts/multimodality.mdx
+65-15
Original file line number
Diff line number
Diff line change
@@ -15,7 +15,10 @@
15
15
*[Messages](/docs/concepts/messages)
16
16
:::
17
17
18
-
Multimodal support is still relatively new and less common, model providers have not yet standardized on the "best" way to define the API. As such, LangChain's multimodal abstractions are lightweight and flexible, designed to accommodate different model providers' APIs and interaction patterns, but are **not** standardized across models.
18
+
LangChain supports multimodal data as input to chat models:
19
+
20
+
1. Following provider-specific formats
21
+
2. Adhering to a cross-provider standard (see [how-to guides](/docs/how_to/#multimodal) for detail)
19
22
20
23
### How to use multimodal models
21
24
@@ -26,38 +29,85 @@ Multimodal support is still relatively new and less common, model providers have
26
29
27
30
#### Inputs
28
31
29
-
Some models can accept multimodal inputs, such as images, audio, video, or files. The types of multimodal inputs supported depend on the model provider. For instance, [Google's Gemini](/docs/integrations/chat/google_generative_ai/) supports documents like PDFs as inputs.
32
+
Some models can accept multimodal inputs, such as images, audio, video, or files.
33
+
The types of multimodal inputs supported depend on the model provider. For instance,
34
+
[OpenAI](/docs/integrations/chat/openai/),
35
+
[Anthropic](/docs/integrations/chat/anthropic/), and
The gist of passing multimodal inputs to a chat model is to use content blocks that
40
+
specify a type and corresponding data. For example, to pass an image to a chat model
41
+
as URL:
30
42
31
-
Most chat models that support **multimodal inputs** also accept those values in OpenAI's content blocks format. So far this is restricted to image inputs. For models like Gemini which support video and other bytes input, the APIs also support the native, model-specific representations.
43
+
```python
44
+
from langchain_core.messages import HumanMessage
45
+
46
+
message = HumanMessage(
47
+
content=[
48
+
{"type": "text", "text": "Describe the weather in this image:"},
49
+
{
50
+
"type": "image",
51
+
"source_type": "url",
52
+
"url": "https://...",
53
+
},
54
+
],
55
+
)
56
+
response = model.invoke([message])
57
+
```
32
58
33
-
The gist of passing multimodal inputs to a chat model is to use content blocks that specify a type and corresponding data. For example, to pass an image to a chat model:
59
+
We can also pass the image as in-line data:
34
60
35
61
```python
36
62
from langchain_core.messages import HumanMessage
37
63
38
64
message = HumanMessage(
39
65
content=[
40
-
{"type": "text", "text": "describe the weather in this image"},
0 commit comments