Skip to content
/ ReT Public

[CVPR 2025] Recurrence-Enhanced Vision-and-Language Transformers for Robust Multimodal Document Retrieval

License

Notifications You must be signed in to change notification settings

aimagelab/ReT

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ReT: Recurrence-Enhanced Vision-and-Language Transformers for Robust Multimodal Document Retrieval (CVPR 2025)

Paper ReT Dataset

ReT

Please cite with the following BibTeX:

@inproceedings{caffagni2025recurrence,
  title={{Recurrence-Enhanced Vision-and-Language Transformers for Robust Multimodal Document Retrieval}},
  author={Caffagni, Davide and Sarto, Sara and Cornia, Marcella and Baraldi, Lorenzo and Cucchiara, Rita},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  year={2025}
}

Installation

  1. Create the Python environment.
conda create -n ret -y --no-default-packages python==3.10.16
conda activate ret
  1. Install Pytorch.
pip install torch==2.2.2 torchvision==0.17.2 torchaudio==2.2.2 --index-url https://download.pytorch.org/whl/cu118
  1. Install faiss-gpu.
conda install -n ret -y -c conda-forge faiss-gpu==1.7.4
  1. Clone the repo and install other dependencies.
git clone https://github.com/aimagelab/ReT.git
cd ReT
pip install -r requirements.txt

Pre-trained models 🤗

ReT model checkpoints are available on Hugging Face. You can use these checkpoints directly for retrieval tasks or fine-tune them to suit your specific retrieval needs.

Available Checkpoints and Benchmark Results

Model WIT Recall@10 IGLUE Recall@1 KVQA Recall@5 OVEN Recall@5 LLaVA Recall@1 InfoSeek Recall@5 InfoSeek Pseudo Recall@5 EVQA Recall@5 EVQA Pseudo Recall@5 OKVQA Recall@5 OKVQA Pseudo Recall@5
ReT-CLIP-ViT-L-14🤗 0.734 0.818 0.635 0.820 0.799 0.470 0.605 0.445 0.579 0.202 0.662
ReT-OpenCLIP-ViT-H-14🤗 0.714 0.800 0.593 0.830 0.798 0.473 0.607 0.448 0.578 0.182 0.634
ReT-OpenCLIP-ViT-G-14🤗 0.751 0.822 0.606 0.840 0.792 0.520 0.625 0.486 0.602 0.190 0.638

ReT-M2KR Dataset 🤗

You can download the ReT-M2KR benchmark by following the instructions provided here. This dataset is used for training and evaluating ReT in multimodal information retrieval and includes images (coming soon) and JSONL files.

ReT-M2KR benchmark is an extended version of the M2KR dataset, with the following modifications:

  • MSMARCO data is excluded, as it does not contain query images
  • Passage images have been added to the OVEN, InfoSeek, E-VQA, and OKVQA datasets

For further details, please refer to the associated research paper.

Use with Transformers

from src.models import RetrieverModel, RetModel
import torch

device = 'cuda' if torch.cuda.is_available() else 'cpu'
retriever = RetrieverModel.from_pretrained('aimagelab/ReT-CLIP-ViT-L-14', device_map=device)

# QUERY
ret: RetModel = retriever.get_query_model()
ret.init_tokenizer_and_image_processor()
q_txt = "Retrieve documents that provide an answer to the question alongside the image: What is the content of the image?"
q_img = 'assets/model.png'

ret_feats = ret.get_ret_features([[q_txt, q_img]])
print(ret_feats.shape)  # torch.Size([1, 32, 128])


# PASSAGE
ret: RetModel = retriever.get_passage_model()
ret.init_tokenizer_and_image_processor()

p_txt = """The image shows a diagram of what appears to be a neural network architecture using a fine-grained loss approach for multimodal learning.
The architecture has two parallel processing streams labeled "ReTQ" (left side, in purple) and "ReTD" (right side, in blue).
Each side has: ..."""
p_img = ''

ret_feats = ret.get_ret_features([[p_txt, p_img]])
print(ret_feats.shape)  # torch.Size([1, 32, 128])

Indexing and Searching

To evaluate ReT on the on M2KR benchmark, we provide SLURM script examples here. These scripts handle both indexing and searching processes.

Make sure to set JSONL_ROOT_PATH and IMAGE_ROOT_PATH accordingly to the directories where the JSONL files and images have been downloaded.

Known issue

If the inference script got stuck while indexing, try to clear the Pytorch cache and re-run

rm -rf ~/.cache/torch_extensions  

Acknowledgments

We thank the teams behind ColBERT, PreFLMR, and UniIR for open-sourcing their models, datasets, and code.

About

[CVPR 2025] Recurrence-Enhanced Vision-and-Language Transformers for Robust Multimodal Document Retrieval

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published