-
Notifications
You must be signed in to change notification settings - Fork 2.3k
add multimodal insert processing #1424
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
add multimodal insert processing #1424
Conversation
@LarFii Take a look at this PR. I am aware that you are also working on multimodal topics. |
@drahnreb Thank you so much for sharing! In fact, we’re also actively exploring multimodal processing. Under the This setup allows us to uniformly process any type of content, including image-based elements, and build connections between these modal elements and the corresponding nodes in the original graph. You're very welcome to take a look at the multimodalprocessor branch and share your thoughts! Here's a quick example of how it's used: rag = await initialize_rag()
processor = MultiModalProcessor(
modal_caption_func=modal_caption_func,
text_chunks_db=rag.text_chunks,
chunks_vdb=rag.chunks_vdb,
entities_vdb=rag.entities_vdb,
relationships_vdb=rag.relationships_vdb,
knowledge_graph_inst=rag.chunk_entity_relation_graph,
embedding_func=rag.embedding_func,
llm_model_func=rag.llm_model_func,
global_config=asdict(rag),
hashing_kv=rag.llm_response_cache,
)
modal_content = xxxx
content_type = "markdown_table"
enhanced_caption, entity_info = await processor.process_multimodal_content(
modal_content,
content_type,
entity_name="LightRAG Experiment Results Table",
top_k=5,
better_than_threshold=0.7,
) |
This is great @LarFii I have seen the branch before, but thanks for sharing the newest concepts! I guess one could run And if we assume we would also integrate the I could help with preparing a couple of things on that side:
Do you have already an interface for the Let me know if support is needed, if you have a roadmap please kindly share to coordinate a bit. |
Currently, our plan is to offer the special multimodal processing as an optional feature for users to choose whether they need it, as splitting the document by modality requires certain hardware capabilities. The structure of |
Description
Create caption embeddings (textual, strictly not multimodal embeddings) and extracting entities from images in
\<img\>
tags if presented markdown text during insert. This multimodality allows for retrieval of content in the image which is currently ignored.This would essentially build a powerful semi-multimodal knowledge graph that allows for retrieval based on image content.
Related Issues
#1418
Changes Made
extract_images_from_content()
util to extract images from content if\<img\>
tags are present.global_config["multimodal_llm_model_func"]
optional param that is a multimodal model that supports image captioning and triggers both a) textembed the caption (instead of image itself) and b) kg extraction from image/caption.process_document
to embed images: possibly linked to [Feature Request]: "entity_continue_extraction" should be formulated a bit differently & new chunking function #1379, we need to carefully not splitchunks_vdb.upsert(chunks)
._process_single_content()
to adduse_multimodal_llm_func
to extractmaybe_nodes
,maybe_edges
from image and/or image captions.use_llm_func
to allow for better reuse ofuse_llm_func_with_cache
foruse_multimodal_llm_func
Checklist
Additional Notes
@danielaskdd it is a draft, but please comment, if I miss something. Especially for correct chunk advise is needed. Do we treat an image as its own chunk or associated to text before and after?