This repository contains the reference code for the paper Augmenting Multimodal LLMs with Self-Reflective Tokens for Knowledge-based Visual Question Answering.
🎯 Project web page | Paper | 🤗 HuggingFace Model | 🤗 HuggingFace Dataset |
Please cite this work with the following BibTeX:
@inproceedings{cocchi2024augmenting,
title={{Augmenting Multimodal LLMs with Self-Reflective Tokens for Knowledge-based Visual Question Answering}},
author={Cocchi, Federico and Moratelli, Nicholas and Cornia, Marcella and Baraldi, Lorenzo and Cucchiara, Rita},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
year={2025}
}
Multimodal LLMs (MLLMs) are the natural extension of
large language models to handle multimodal inputs, combining text and image data.
They have recently garnered attention due to their capability to address complex tasks involving both modalities.
However, their effectiveness is limited to the knowledge acquired during training, which restricts their practical utility.
In this work, we introduce a novel method to enhance the adaptability of MLLMs by integrating external knowledge sources.
Our proposed model, Reflective LLaVA (ReflectiVA
), utilizes reflective tokens to dynamically determine the need for external knowledge
and predict the relevance of information retrieved from an external database, ultimately enables the MLLM to manage external knowledge
while preserving fluency and performance on tasks where external knowledge is not needed.
To create the conda environment named reflectiva use the following instructions. With this environment you have all the packages to run the code inside this repo.
conda create -n reflectiva python==3.8.16
conda activate reflectiva
pip install -r requirements.txt
You can access the official model weights for the ReflectiVA model on 🤗 Hugging Face.
The official training dataset can be accessed on 🤗 Hugging Face.
You can use this link to download the evaluation data for Infoseek.
You can find the evaluation data for Encyclopedic-VQA at this link. Additionally, the images used for evaluation can be extracted from this zip file.
In our work two different main knowledge bases are utilized. To enhance the reproducibility of our approach, we provide access to both the knowledge bases and the FAISS index built on them for the best configuration presented in the paper. Specifically, the embeddings are generated using the EVA-CLIP model.
For Infoseek, you can find the index and json file inside this zip file. Similarly, the index and json file for Encyclopedic-VQA are available here.
Please refer to the paper to more information about KB.
Before running the inference, unzip the data and modify the paths in the .sh
files to align with your local cluster setup and the files downloaded in the previous step.
Inference code for Infoseek:
sbatch scripts/ReflectiVA_infoseek.sh
Inference code for Encyclopedic-VQA:
sbatch scripts/ReflectiVA_evqa.sh
We acknowledge the CINECA award under the ISCRA initiative, for the availability of high-performance computing resources and support. This work has been conducted under a research grant co-funded by Altilia s.r.l., and supported by the PNRRM4C2 project “FAIR - Future Artificial Intelligence Research”, funded by the European Commission, and by the PNRR project “Italian Strengthening of Esfri RI Resilience” (ITSERR) funded by the European Union - NextGenerationEU (CUP B53C22001770006).
We are thankful to LLaVA, lmms-eval for releasing their models and code as open-source contributions.
Finally, we would also like to thank Davide Caffagni and Sara Sarto for their valuable support and insights.