README.md

Data Preparation

Preparing the datasets for VILA training and testing requires three steps:

Downloading all the datasets (Information to download each dataset is provided in the readme.md for the vqa, report and expert directories)
Generating the instruction data for all datasets (Information to generate the instruction data is provided in the readme.md for the vqa, report and expert directory)
Adding the prepared datasets to VILA in a data mixture (More information can be found in the quickstart guide)

PathVQA: Pathology-based VQA dataset with ~4,000 images and ~32,000 QA pairs, focusing on microscopic views of human tissue.
RadVQA: Radiology VQA dataset containing ~7,000 images and ~25,000 QA pairs, covering various imaging modalities like X-rays and CT scans.
SLAKE: Specialized medical VQA dataset with ~14,000 images and ~45,000 QA pairs, emphasizing anatomy, modality, and abnormality questions.
Medical-Diff-VQA: Medical-Diff-VQA dataset, a derivative of the MIMIC-CXR dataset, consists of questions categorized into seven categories: abnormality, location, type, level, view, presence, and difference. We currently exclude the difference category in our training preparation.

MIMIC-CXR-JPG: The MIMIC-CXR-JPG Database v2.0.0 is a publicly available dataset containing 377,110 chest X-ray images in JPG format, along with structured labels derived from 227,827 radiology reports. The dataset is a processed version of MIMIC-CXR, with removed protected health information (PHI) to comply with HIPAA regulations. Its purpose is to support medical research in image understanding, natural language processing, and decision support, providing a standard reference for data splits and image labels.

ChestXRay14 Diverse and high-quality labeled chest x-ray images with findings positive for pneumothorax, opacity, nodule or mass, or fracture from Majkowska and Mittal et al. Chest Radiograph Interpretation with Deep Learning Models: Assessment with Radiologist-adjudicated Reference Standards and Population-adjusted Evaluation, Radiology 2020 294:2, 421-431.
CheXpert test set A test set of the CheXpert dataset consisting of 500 studies from 500 patients randomly sampled from the 1000 studies in the report test set. Eight board-certified radiologists individually annotated each of the studies in the test set. Please see also the github readme and data page for more details.

Dataset	QA/Text Pairs	Images	Train/Eval
PathVQA	~32,000	~4,000	Train/Eval
RadVQA	~25,000	~7,000	Train/Eval
SLAKE	~45,000	~14,000	Train/Eval
Medical-Diff-VQA	~429,000	~129,000	Train/Eval
MIMIC-CXR-JPG	~271,000	~271,000	Train/Eval
nih-chest-xray	~2,000	~2,000	Eval
CheXpert (test-set-labels)	500	500	Eval
Totals	>800,000	>427,000

To create datasets for training expert model selection capabilities, please follow the instructions in the expert directory.

We use the following datasets and expert models for expert model selection.

Modality	Expert	Datasets
CT	VISTA3D	MSD (liver, spleen, pancreas), TotalSegmentatorV2
MRI	BRATS (SegResNet)	BRATS (2018)
Chest X-Ray	TorchXRayVision	MIMIC (Reports, VQA)