ReT: Recurrence-Enhanced Vision-and-Language Transformers for Robust Multimodal Document Retrieval (CVPR 2025)

Please cite with the following BibTeX:

@inproceedings{caffagni2025recurrence,
  title={{Recurrence-Enhanced Vision-and-Language Transformers for Robust Multimodal Document Retrieval}},
  author={Caffagni, Davide and Sarto, Sara and Cornia, Marcella and Baraldi, Lorenzo and Cucchiara, Rita},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  year={2025}
}

Installation

Create the Python environment.

conda create -n ret -y --no-default-packages python==3.10.16
conda activate ret

Install Pytorch.

pip install torch==2.2.2 torchvision==0.17.2 torchaudio==2.2.2 --index-url https://download.pytorch.org/whl/cu118

Install faiss-gpu.

conda install -n ret -y -c conda-forge faiss-gpu==1.7.4

Clone the repo and install other dependencies.

git clone https://github.com/aimagelab/ReT.git
cd ReT
pip install -r requirements.txt

Pre-trained models 🤗

ReT model checkpoints are available on Hugging Face. You can use these checkpoints directly for retrieval tasks or fine-tune them to suit your specific retrieval needs.

Available Checkpoints and Benchmark Results

Model	WIT Recall@10	IGLUE Recall@1	KVQA Recall@5	OVEN Recall@5	LLaVA Recall@1	InfoSeek Recall@5	InfoSeek Pseudo Recall@5	EVQA Recall@5	EVQA Pseudo Recall@5	OKVQA Recall@5	OKVQA Pseudo Recall@5
ReT-CLIP-ViT-L-14🤗	0.734	0.818	0.635	0.820	0.799	0.470	0.605	0.445	0.579	0.202	0.662
ReT-OpenCLIP-ViT-H-14🤗	0.714	0.800	0.593	0.830	0.798	0.473	0.607	0.448	0.578	0.182	0.634
ReT-OpenCLIP-ViT-G-14🤗	0.751	0.822	0.606	0.840	0.792	0.520	0.625	0.486	0.602	0.190	0.638

ReT-M2KR Dataset 🤗

You can download the ReT-M2KR benchmark by following the instructions provided here. This dataset is used for training and evaluating ReT in multimodal information retrieval and includes images (coming soon) and JSONL files.

ReT-M2KR benchmark is an extended version of the M2KR dataset, with the following modifications:

MSMARCO data is excluded, as it does not contain query images
Passage images have been added to the OVEN, InfoSeek, E-VQA, and OKVQA datasets

For further details, please refer to the associated research paper.

Use with Transformers

from src.models import RetrieverModel, RetModel
import torch

device = 'cuda' if torch.cuda.is_available() else 'cpu'
retriever = RetrieverModel.from_pretrained('aimagelab/ReT-CLIP-ViT-L-14', device_map=device)

# QUERY
ret: RetModel = retriever.get_query_model()
ret.init_tokenizer_and_image_processor()
q_txt = "Retrieve documents that provide an answer to the question alongside the image: What is the content of the image?"
q_img = 'assets/model.png'

ret_feats = ret.get_ret_features([[q_txt, q_img]])
print(ret_feats.shape)  # torch.Size([1, 32, 128])


# PASSAGE
ret: RetModel = retriever.get_passage_model()
ret.init_tokenizer_and_image_processor()

p_txt = """The image shows a diagram of what appears to be a neural network architecture using a fine-grained loss approach for multimodal learning.
The architecture has two parallel processing streams labeled "ReTQ" (left side, in purple) and "ReTD" (right side, in blue).
Each side has: ..."""
p_img = ''

ret_feats = ret.get_ret_features([[p_txt, p_img]])
print(ret_feats.shape)  # torch.Size([1, 32, 128])

Indexing and Searching

To evaluate ReT on the on M2KR benchmark, we provide SLURM script examples here. These scripts handle both indexing and searching processes.

Make sure to set JSONL_ROOT_PATH and IMAGE_ROOT_PATH accordingly to the directories where the JSONL files and images have been downloaded.

Known issue

If the inference script got stuck while indexing, try to clear the Pytorch cache and re-run

rm -rf ~/.cache/torch_extensions

Acknowledgments

We thank the teams behind ColBERT, PreFLMR, and UniIR for open-sourcing their models, datasets, and code.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
assets		assets
scripts		scripts
src		src
third_party		third_party
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
inference.py		inference.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ReT: Recurrence-Enhanced Vision-and-Language Transformers for Robust Multimodal Document Retrieval (CVPR 2025)

Installation

Pre-trained models 🤗

Available Checkpoints and Benchmark Results

ReT-M2KR Dataset 🤗

Use with Transformers

Indexing and Searching

Known issue

Acknowledgments

About

Releases

Packages

Contributors 2

Languages

License

aimagelab/ReT

Folders and files

Latest commit

History

Repository files navigation

ReT: Recurrence-Enhanced Vision-and-Language Transformers for Robust Multimodal Document Retrieval (CVPR 2025)

Installation

Pre-trained models 🤗

Available Checkpoints and Benchmark Results

ReT-M2KR Dataset 🤗

Use with Transformers

Indexing and Searching

Known issue

Acknowledgments

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages