add multimodal insert processing #1424

drahnreb · 2025-04-21T11:09:41Z

Description

Create caption embeddings (textual, strictly not multimodal embeddings) and extracting entities from images in \<img\> tags if presented markdown text during insert. This multimodality allows for retrieval of content in the image which is currently ignored.

This would essentially build a powerful semi-multimodal knowledge graph that allows for retrieval based on image content.

Related Issues

#1418

Changes Made

add new extract_images_from_content() util to extract images from content if \<img\> tags are present.
add new separate global_config["multimodal_llm_model_func"] optional param that is a multimodal model that supports image captioning and triggers both a) textembed the caption (instead of image itself) and b) kg extraction from image/caption.
add new prompt template for image captioning.
adapt process_document to embed images: possibly linked to [Feature Request]: "entity_continue_extraction" should be formulated a bit differently & new chunking function #1379, we need to carefully not split tags, prompt the llm to give image captions for extracted images, add those captions to chunks to textembed captions by feeding into chunks_vdb.upsert(chunks).
adapted _process_single_content() to add use_multimodal_llm_func to extract maybe_nodes, maybe_edges from image and/or image captions.
add kwargs handling of use_llm_func to allow for better reuse of use_llm_func_with_cache for use_multimodal_llm_func

Checklist

Changes tested locally
Code reviewed
Documentation updated (if necessary)
Unit tests added (if applicable)

Additional Notes

@danielaskdd it is a draft, but please comment, if I miss something. Especially for correct chunk advise is needed. Do we treat an image as its own chunk or associated to text before and after?

danielaskdd · 2025-04-21T12:21:33Z

@LarFii Take a look at this PR. I am aware that you are also working on multimodal topics.

LarFii · 2025-04-21T15:46:24Z

@drahnreb Thank you so much for sharing!

In fact, we’re also actively exploring multimodal processing. Under the multimodalprocessor branch, we’ve developed an initial approach. Our idea is to leverage tools like MinerU to split documents into different modalities (e.g., text, tables, formulas, images, etc.). Text content is handled using the standard LightRAG pipeline, while other modalities are processed through the logic in the multimodalprocessor branch.

This setup allows us to uniformly process any type of content, including image-based elements, and build connections between these modal elements and the corresponding nodes in the original graph.

You're very welcome to take a look at the multimodalprocessor branch and share your thoughts! Here's a quick example of how it's used:

rag = await initialize_rag()

processor = MultiModalProcessor(
    modal_caption_func=modal_caption_func,
    text_chunks_db=rag.text_chunks,
    chunks_vdb=rag.chunks_vdb,
    entities_vdb=rag.entities_vdb,
    relationships_vdb=rag.relationships_vdb,
    knowledge_graph_inst=rag.chunk_entity_relation_graph,
    embedding_func=rag.embedding_func,
    llm_model_func=rag.llm_model_func,
    global_config=asdict(rag),
    hashing_kv=rag.llm_response_cache,
)

modal_content = xxxx 
content_type = "markdown_table"

enhanced_caption, entity_info = await processor.process_multimodal_content(
    modal_content,
    content_type,
    entity_name="LightRAG Experiment Results Table",
    top_k=5, 
    better_than_threshold=0.7,
)

drahnreb · 2025-04-21T21:20:53Z

This is great @LarFii

I have seen the branch before, but thanks for sharing the newest concepts!
I like the two stage approach, for kg extraction.

I guess one could run multiprocessor independently for e.g. video or audio data like you demonstrated.

And if we assume we would also integrate the multiprocessor into .ainsert() to enable mixed modal data.
If the user wants, this could be even automatically, because having different modal data is straightforward but if you have a mix it get's more complicated but powerful.
A mix could be supported via ie. markdown with formatted text, images, diagrams (like mermaid), code, formulas, tables.

I could help with preparing a couple of things on that side:

prepare more advanced "modal-aware" chunking
identification of modal type (entity types) per chunk
control behavior (prompts should be user facing...)

Do you have already an interface for the modal_llm_func?
Could it share the caching use_llm_func_with_cache or does it need adaptation for multimodality?

Let me know if support is needed, if you have a roadmap please kindly share to coordinate a bit.

LarFii · 2025-04-22T07:43:52Z

Currently, our plan is to offer the special multimodal processing as an optional feature for users to choose whether they need it, as splitting the document by modality requires certain hardware capabilities. The structure of modal_llm_func has not been finalized yet. Thank you very much for offering your help!

drahnreb added 2 commits April 21, 2025 06:45

WIP: HKUDS#1418

9ef07a9

WIP: HKUDS#1418, minor changes in example

3243c21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add multimodal insert processing #1424

add multimodal insert processing #1424

drahnreb commented Apr 21, 2025 •

edited

Loading

danielaskdd commented Apr 21, 2025

LarFii commented Apr 21, 2025 •

edited

Loading

drahnreb commented Apr 21, 2025

LarFii commented Apr 22, 2025

add multimodal insert processing #1424

Are you sure you want to change the base?

add multimodal insert processing #1424

Conversation

drahnreb commented Apr 21, 2025 • edited Loading

Description

Related Issues

Changes Made

Checklist

Additional Notes

danielaskdd commented Apr 21, 2025

LarFii commented Apr 21, 2025 • edited Loading

drahnreb commented Apr 21, 2025

LarFii commented Apr 22, 2025

drahnreb commented Apr 21, 2025 •

edited

Loading

LarFii commented Apr 21, 2025 •

edited

Loading