Skip to content

add multimodal insert processing #1424

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

drahnreb
Copy link
Contributor

@drahnreb drahnreb commented Apr 21, 2025

Description

Create caption embeddings (textual, strictly not multimodal embeddings) and extracting entities from images in \<img\> tags if presented markdown text during insert. This multimodality allows for retrieval of content in the image which is currently ignored.

This would essentially build a powerful semi-multimodal knowledge graph that allows for retrieval based on image content.

Related Issues

#1418

Changes Made

  • add new extract_images_from_content() util to extract images from content if \<img\> tags are present.
  • add new separate global_config["multimodal_llm_model_func"] optional param that is a multimodal model that supports image captioning and triggers both a) textembed the caption (instead of image itself) and b) kg extraction from image/caption.
  • add new prompt template for image captioning.
  • adapt process_document to embed images: possibly linked to [Feature Request]: "entity_continue_extraction" should be formulated a bit differently & new chunking function #1379, we need to carefully not split tags, prompt the llm to give image captions for extracted images, add those captions to chunks to textembed captions by feeding into chunks_vdb.upsert(chunks).
  • adapted _process_single_content() to add use_multimodal_llm_func to extract maybe_nodes, maybe_edges from image and/or image captions.
  • add kwargs handling of use_llm_func to allow for better reuse of use_llm_func_with_cache for use_multimodal_llm_func

Checklist

  • Changes tested locally
  • Code reviewed
  • Documentation updated (if necessary)
  • Unit tests added (if applicable)

Additional Notes

@danielaskdd it is a draft, but please comment, if I miss something. Especially for correct chunk advise is needed. Do we treat an image as its own chunk or associated to text before and after?

@danielaskdd
Copy link
Collaborator

@LarFii Take a look at this PR. I am aware that you are also working on multimodal topics.

@LarFii
Copy link
Collaborator

LarFii commented Apr 21, 2025

@drahnreb Thank you so much for sharing!

In fact, we’re also actively exploring multimodal processing. Under the multimodalprocessor branch, we’ve developed an initial approach. Our idea is to leverage tools like MinerU to split documents into different modalities (e.g., text, tables, formulas, images, etc.). Text content is handled using the standard LightRAG pipeline, while other modalities are processed through the logic in the multimodalprocessor branch.

This setup allows us to uniformly process any type of content, including image-based elements, and build connections between these modal elements and the corresponding nodes in the original graph.

You're very welcome to take a look at the multimodalprocessor branch and share your thoughts! Here's a quick example of how it's used:

rag = await initialize_rag()

processor = MultiModalProcessor(
    modal_caption_func=modal_caption_func,
    text_chunks_db=rag.text_chunks,
    chunks_vdb=rag.chunks_vdb,
    entities_vdb=rag.entities_vdb,
    relationships_vdb=rag.relationships_vdb,
    knowledge_graph_inst=rag.chunk_entity_relation_graph,
    embedding_func=rag.embedding_func,
    llm_model_func=rag.llm_model_func,
    global_config=asdict(rag),
    hashing_kv=rag.llm_response_cache,
)

modal_content = xxxx 
content_type = "markdown_table"

enhanced_caption, entity_info = await processor.process_multimodal_content(
    modal_content,
    content_type,
    entity_name="LightRAG Experiment Results Table",
    top_k=5, 
    better_than_threshold=0.7,
)

@drahnreb
Copy link
Contributor Author

This is great @LarFii

I have seen the branch before, but thanks for sharing the newest concepts!
I like the two stage approach, for kg extraction.

I guess one could run multiprocessor independently for e.g. video or audio data like you demonstrated.

And if we assume we would also integrate the multiprocessor into .ainsert() to enable mixed modal data.
If the user wants, this could be even automatically, because having different modal data is straightforward but if you have a mix it get's more complicated but powerful.
A mix could be supported via ie. markdown with formatted text, images, diagrams (like mermaid), code, formulas, tables.

I could help with preparing a couple of things on that side:

  • prepare more advanced "modal-aware" chunking
  • identification of modal type (entity types) per chunk
  • control behavior (prompts should be user facing...)

Do you have already an interface for the modal_llm_func?
Could it share the caching use_llm_func_with_cache or does it need adaptation for multimodality?

Let me know if support is needed, if you have a roadmap please kindly share to coordinate a bit.

@LarFii
Copy link
Collaborator

LarFii commented Apr 22, 2025

Currently, our plan is to offer the special multimodal processing as an optional feature for users to choose whether they need it, as splitting the document by modality requires certain hardware capabilities. The structure of modal_llm_func has not been finalized yet. Thank you very much for offering your help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants