Skip to content

[CVPR 2025] Augmenting Multimodal LLMs with Self-Reflective Tokens for Knowledge-based Visual Question Answering

License

Notifications You must be signed in to change notification settings

aimagelab/ReflectiVA

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Reflective LLaVA (ReflectiVA)

Augmenting Multimodal LLMs with Self-Reflective Tokens for Knowledge-based Visual Question Answering

(CVPR 2025)



reflectiva

This repository contains the reference code for the paper Augmenting Multimodal LLMs with Self-Reflective Tokens for Knowledge-based Visual Question Answering.

🎯 Project web page | Paper | 🤗 HuggingFace Model | 🤗 HuggingFace Dataset |

Table of Contents

  1. Citation
  2. Overview
  3. Installation
  4. Model
  5. Knowledge Based
  6. Inference
  7. Acknowledgements

Citation

Please cite this work with the following BibTeX:

@inproceedings{cocchi2024augmenting,
  title={{Augmenting Multimodal LLMs with Self-Reflective Tokens for Knowledge-based Visual Question Answering}},
  author={Cocchi, Federico and Moratelli, Nicholas and Cornia, Marcella and Baraldi, Lorenzo and Cucchiara, Rita},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  year={2025}
}

Overview

Multimodal LLMs (MLLMs) are the natural extension of large language models to handle multimodal inputs, combining text and image data. They have recently garnered attention due to their capability to address complex tasks involving both modalities. However, their effectiveness is limited to the knowledge acquired during training, which restricts their practical utility. In this work, we introduce a novel method to enhance the adaptability of MLLMs by integrating external knowledge sources. Our proposed model, Reflective LLaVA (ReflectiVA), utilizes reflective tokens to dynamically determine the need for external knowledge and predict the relevance of information retrieved from an external database, ultimately enables the MLLM to manage external knowledge while preserving fluency and performance on tasks where external knowledge is not needed.

Installation

To create the conda environment named reflectiva use the following instructions. With this environment you have all the packages to run the code inside this repo.

conda create -n reflectiva python==3.8.16
conda activate reflectiva
pip install -r requirements.txt

Model

You can access the official model weights for the ReflectiVA model on 🤗 Hugging Face.

Dataset

The official training dataset can be accessed on 🤗 Hugging Face.

Data Infoseek

You can use this link to download the evaluation data for Infoseek.

Data Encyclopedic-VQA

You can find the evaluation data for Encyclopedic-VQA at this link. Additionally, the images used for evaluation can be extracted from this zip file.

Knowledge Bases and Reproducibility

In our work two different main knowledge bases are utilized. To enhance the reproducibility of our approach, we provide access to both the knowledge bases and the FAISS index built on them for the best configuration presented in the paper. Specifically, the embeddings are generated using the EVA-CLIP model.

For Infoseek, you can find the index and json file inside this zip file. Similarly, the index and json file for Encyclopedic-VQA are available here.

Please refer to the paper to more information about KB.

Inference

Before running the inference, unzip the data and modify the paths in the .sh files to align with your local cluster setup and the files downloaded in the previous step.

Inference code for Infoseek:

sbatch scripts/ReflectiVA_infoseek.sh

Inference code for Encyclopedic-VQA:

sbatch scripts/ReflectiVA_evqa.sh

Acknowledgements

We acknowledge the CINECA award under the ISCRA initiative, for the availability of high-performance computing resources and support. This work has been conducted under a research grant co-funded by Altilia s.r.l., and supported by the PNRRM4C2 project “FAIR - Future Artificial Intelligence Research”, funded by the European Commission, and by the PNRR project “Italian Strengthening of Esfri RI Resilience” (ITSERR) funded by the European Union - NextGenerationEU (CUP B53C22001770006).

We are thankful to LLaVA, lmms-eval for releasing their models and code as open-source contributions.

Finally, we would also like to thank Davide Caffagni and Sara Sarto for their valuable support and insights.

About

[CVPR 2025] Augmenting Multimodal LLMs with Self-Reflective Tokens for Knowledge-based Visual Question Answering

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •