MedVision Logo MedVision: Dataset and Benchmark for Quantitative Medical Image Analysis

Yongcheng Yao , Yongshuo Zong , Raman Dutt , Yongxin Yang , Sotirios A Tsaftaris , Timothy Hospedales

MedVision is a dataset and benchmark for quantitative medical image analysis, including detection, tumor/lesion (T/L) size estimation, and angle/distance (A/D) measurement tasks.

MedVision overview

🌟 Highlights

  • Research Gap: modern VLMs are not reliably able to produce precise quantitative measurements from medical images.
  • Dataset: MedVision, a large-scale, multi-modality dataset for quantitative medical image analysis, covering 22 public datasets and 30.8M images with structured measurement annotations.
  • Benchmark: the first comprehensive evaluation of contemporary VLMs on detection, tumor/lesion size estimation, and angle/distance measurement.
  • Supervised Finetuning (SFT): SFT can improve the performance of VLMs on quantitative medical image analysis tasks.
  • Code and Models: data, model checkpoints, and code (training and evaluation) are available

📀 Dataset

MedVision includes 22 public datasets and 30.8M image-annotation pairs. The dataset is available at here. Details are as follows:

Table 1: The MedVision dataset consists of public medical images and quantitative annotations from this study. MRI: Magnetic Resonance Imaging; CT: Computed Tomography; PET: positron emission tomography; US: Ultrasound; b-box: bounding box; T/L: tumor/lesion size; A/D: angle/distance; HF: HuggingFace; GC: Grand-Challenge; redistributed.

Dataset Anatomy Modality Annotation Availability Source # Sample (Train / Test)
b-box T/L A/D
AbdomenAtlasabdomenCTb-boxopenHF6.8 / 2.9M00
AbdomenCT-1KabdomenCTb-boxopenZenodo0.7 / 0.3M00
ACDCheartMRIb-boxopenHF, others9.5 / 4.8K00
AMOS22abdomenCT, MRIb-boxopenZenodo0.8 / 0.3M00
autoPEI-IIIwhole bodyCT, PETb-box, T/LopenHF, others22 / 9.7K0.5 / 0.2K0
BCV15abdomenCTb-boxopenHF, Synapse71 / 30K00
BraTS24brainMRIb-box, T/LopenHF, Synapse0.8 / 0.3M7.9 / 3.1K0
CAMUSheartUSb-boxopenHF, others0.7 / 0.3M00
Ceph-Bio-400head and neckX-rayb-box, A/DopenHF, others005.3 / 2.3K
CrossModDAbrainMRIb-boxopenHF, Zenodo3.0 / 1.0K00
FeTA24fetal brainMRIb-box, A/DregistrationSynapse34 / 15K00.2 / 0.1K
FLARE22abdomenCTb-boxopenHF, others72 / 33K00
HNTSMRG24head and neckMRIb-box, T/LopenZenodo18 / 6.6K1.0 / 0.4K0
ISLES24brainMRIb-boxopenHF, GC7.3 / 2.5K00
KiPA22kidneyCTb-box, T/LopenHF, GC26 / 11K2.1 / 1.0K0
KiTS23kidneyCTb-box, T/LopenHF, GC80 / 35K5.9 / 2.6K0
MSDmultipleCT, MRIb-box, T/Lopenothers0.2 / 0.1M5.3 / 2.2K0
OAIZIB-CMkneeMRIb-boxopenHF0.5 / 0.2M00
SKM-TEAkneeMRIb-boxregistrationothers0.2 / 0.1M00
ToothFairy2toothCTb-boxregistrationothers1.0 / 0.4M00
TopCoW24brainCT, MRIb-boxopenHF, Zenodo43 / 20K00
TotalSegmentatormultipleCT, MRIb-boxopenHF, Zenodo9.6 / 4.0M00
Total22 / 9.2M23 / 9.6K5.6 / 2.4K
sample images

Q: How to use the dataset?

  • Quick start
  • Prepare dataset for MedVision benchmarking and finetuning: here

1️⃣ Detection

Q: Can VLMs localize healthy anatomical structures and abnormalities from medical images?

  • Pretrained VLMs show limited ability in medical image detection tasks.
  • With SFT, VLMs achieve dramatic improvement in the detection of both healthy anatomical structures and tumors/lesions.
  • Small object and tumors/lesions detection is still challenging.
  • SFT improves models’ generalizability (details in paper)

Group-level detection performance

Table 2: VLM performance on detection tasks. Targets are grouped into health anatomy and tumor/lesion detection tasks. Mean metrics weighted by sample sizes are reported in %. R: recall; P: precision; F1: F1 score; IoU: intersection over union; SR: success rate.

Model Anatomy Tumor/Lesion
R P F1 IoU SR IoU>0.5 R P F1 IoU SR IoU>0.5
Qwen2.5-VL (7B)50.49.512.47.61001.454.53.45.63.11000.0
Qwen2.5-VL (32B)35.213.616.111.099.75.738.06.28.45.199.90.7
Lingshu (32B)26.419.117.311.61004.023.15.97.24.41000.6
InternVL3 (38B)24.514.514.59.81005.339.25.97.64.51000.4
Gemma3 (27B)33.317.419.413.71009.933.45.16.64.01000.5
MedGemma (4B)66.216.319.012.81007.272.15.28.64.999.60.2
MedGemma (27B)65.617.319.312.91006.565.85.78.85.11000.2
Llama3.2-Vision (11B)47.08.110.46.873.82.745.62.13.62.070.10.0
LLava-OneVision (72B)36.417.818.512.31004.438.16.08.55.11000.4
LLaVA-Med-v1.5 (7B)60.715.718.612.699.17.050.64.77.34.389.00.8
MedDr (40B)64.411.917.311.399.84.174.94.47.54.397.80.1
HuatuoGPT-Vision (34B)25.320.516.310.71003.620.85.97.04.31000.6
HealthGPT-L1421.015.213.58.31000.522.76.26.84.21000.5
Gemini2.5-Flash (w/o tool)35.416.118.712.899.56.841.47.110.16.398.31.2
Gemini2.5-Flash (w tool)29.913.015.210.682.05.838.49.110.26.977.83.6
Qwen2.5-VL (7B, SFT1M)80.679.178.271.610079.555.851.649.441.210046.2
Qwen2.5-VL (32B, SFT1M)82.183.481.274.610082.858.754.852.644.310049.5

Label-level detection performance

This figure shows the detection metrics per label and box-to-image ratio (that indicates relative target size) for each models.

detection metrics per label and box size

2️⃣ Tumor/Lesion Size Estimation

Q: Can VLMs estimate the size of tumors/lesions from medical images?

  • Mean relative error (MRE) for SFT models are ∼ 30%, while it is 50~120% for off-the-shelf VLMs
  • SFT improves models’ generalizability (details in paper)

VLMs are asked to estimate the longest diameter and its perpendicular diameter of tumors/lesions. An example of the quantitative annotations from this work is shown below.

TL samples

Figure 2: Tumor/lesion size annotation. An ellipse is fitted to the tumor/lesion mask and 4 landmarks are recorded.

Table 3: VLM performance on tumor/lesion size estimation tasks. Mean relative error (MRE), success rate (SR), and MRE<k are reported in %, while mean absolute error (MAE) is in millimeters.

Model MAE MRE SR MRE<0.1 MRE<0.2 MRE<0.3
Qwen2.5-VL (7B)32.371.5100.00.31.73.2
Qwen2.5-VL (32B)24.251.9100.02.711.819.9
Lingshu (32B)23.966.1100.010.429.743.4
InternVL3 (38B)22.650.1100.04.515.024.2
Gemma3 (27B)30.770.7100.01.44.98.7
MedGemma (4B)38.9116.8100.01.14.57.6
MedGemma (27B)38.9116.9100.01.14.37.5
Llama3.2-Vision (11B)25.761.999.14.614.021.1
LLava-OneVision (72B)26.283.0100.04.817.729.4
LLaVA-Med-v1.5 (7B)48.874.722.60.30.61.1
MedDr (40B)30.173.8100.01.64.79.4
HuatuoGPT-Vision (34B)28.989.1100.08.526.242.3
HealthGPT-L1423.661.398.98.022.635.8
Qwen2.5-VL (7B, SFT5K)13.230.6100.020.841.562.1
Qwen2.5-VL (32B, SFT5K)12.830.2100.019.643.263.2

3️⃣ Angle/Distance Measurement

Q: Can VLMs measure angles or distances from medical image?

  • Off-the-shelf VLMs fail to measure angles and distances accurately.
  • SFT can significantly improve performance.
  • Small-angle measurements remain challenging.

VLMs are prompted with task description and definition of angle or distance. Examples of landmarks in the Ceph-Bio-400 and FeTA24 datasets are shown below.

ceph-feta

Figure 3: Landmarks in the Ceph-Bio-400 (top-left) and FeTA24 datasets. Ground truth angle and distance measurements are calculated from these landmarks.

Table 4: VLM performance on angle and distance measurement tasks. Mean relative errors (MRE), MRE<0.1, and success rate (SR) are reported in %. Mean absolute errors (MAE) are given in millimeters (distance) and degrees (angle).

Model Ceph-Bio-400 FeTA24
Distance Angle Distance
MAE MRE SR MRE<0.1 MAE MRE SR MRE<0.1 MAE MRE SR MRE<0.1
Qwen2.5-VL (7B)56.380.11000.955.1369198.73.427.561.61002.0
Qwen2.5-VL (32B)59.786.01000.241.032871002.019.549.61006.0
Lingshu (32B)17.326.110015.531.2464810010.438.5120.91000.0
InternVL3 (38B)22.138.210011.166.144051004.319.754.710011.0
Gemma3 (27B)22.632.310022.433.6786210013.818.140.210015.0
MedGemma (4B)43.467.91006.530.527801005.826.260.41005.0
MedGemma (27B)43.968.81006.429.921681006.426.260.41005.0
Llama3.2-Vision (11B)81.6117.41002.955.39318.61008.029.973.41002.0
LLava-OneVision (72B)36.367.810018.862.612269.11000.046.0130.71004.0
MedDr (40B)33.753.61006.054.89149.21000.237.178.51005.0
HuatuoGPT-Vision (34B)45.865.21002.865.85741.11009.040.583.310015.0
HealthGPT-L1466.897.51001.142.9434.61000.032.179.41005.0
Qwen2.5-VL (7B, SFT5K)3.55.410086.43.6126.810050.14.113.110049.0
Qwen2.5-VL (32B, SFT5K)4.57.910083.54.0540.210050.54.114.010054.0

BibTeX

to be updated