MedVision Logo MedVision: Dataset and Benchmark for Quantitative Medical Image Analysis

Yongcheng Yao1 , Yongshuo Zong1 , Raman Dutt1 , Yongxin Yang2 ,
Sotirios A Tsaftaris1 , Timothy Hospedales1
1University of Edinburgh    2Queen Mary University of London

MedVision is a dataset and benchmark for quantitative medical image analysis, including detection, tumor/lesion (T/L) size estimation, and angle/distance (A/D) measurement tasks.

MedVision overview

🌟 Highlights

  • Research gap. Modern VLMs cannot reliably produce precise quantitative measurements from medical images.
  • Dataset. MedVision β€” a large-scale, multi-anatomy, multi-modality dataset for quantitative medical image analysis (22 public datasets, 30.8M image-annotation pairs).
  • Benchmark. The first comprehensive evaluation of contemporary VLMs on detection, tumor/lesion (T/L) size estimation, and angle/distance (A/D) measurement in medical images.
  • Model. MedVision-V0, a 7B model trained on MedVision via supervised fine-tuning (SFT) and reinforcement fine-tuning (RFT); it significantly outperforms all evaluated VLMs across all three tasks β€” a strong, open baseline.
  • Open release. Data, model, and code (training and evaluation) are all publicly available.

🎯 Problem & Tasks

Clinical decisions rely on quantitative assessment β€” measuring a tumor to stage disease, a joint angle to plan surgery, an anatomical distance to track development. We therefore target a concrete model ability: given a medical image, produce precise numeric measurements in real-world physical units (millimeters and degrees, not pixels).

MedVision evaluates this ability across three quantitative tasks:

1️⃣ Detection

Localize healthy anatomical structures and abnormalities with bounding boxes.

2️⃣ Tumor/Lesion Size

Estimate the longest diameter (major axis) and its perpendicular diameter (minor axis) of a tumor/lesion, reported in millimeters.

3️⃣ Angle/Distance

Measure angles (degrees) and distances (mm) from anatomical landmarks.

TL samples

Figure 1: Tumor/lesion size annotation. An ellipse is fitted to the tumor/lesion mask and 4 landmarks are recorded.

ceph-feta

Figure 2: Landmarks in the Ceph-Bio-400 (top-left) and FeTA24 datasets. Ground truth angle and distance measurements are computed from these landmarks.

πŸ“ˆ Benchmark Results

MedVision-V0 vs off-the-shelf VLMs across three tasks

Figure 3: Per-label performance of MedVision-V0 and off-the-shelf VLMs: (a) detection recall / precision / F1, (b) tumor/lesion size MRE, and (c) angle/distance MRE.

MedVision-V0 outperforms all 12 evaluated off-the-shelf VLMs across all three quantitative task families. Each task below leads with the full leaderboard (πŸ₯‡/πŸ₯ˆ/πŸ₯‰ mark the best three per column), followed by an interactive viewer of real predictions β€” the complete prompt, the model’s chain-of-thought response, and the error metrics, beside the image with ground-truth-vs-prediction overlay.

1️⃣ Detection

Table 2: Detection performance (%), grouped into anatomy and tumor/lesion targets. R: recall; P: precision; F1: F1 score; IoU: intersection over union; SR: success rate.

Model Anatomy (18 regions, 13.4K) Tumor/Lesion (8 regions, 8.5K)
RPF1IoUSRIoU>0.5 RPF1IoUSRIoU>0.5
MedVision-V0 (7B)81.3 πŸ₯‡80.4 πŸ₯‡79.1 πŸ₯‡72.0 πŸ₯‡10080.1 πŸ₯‡52.450.5 πŸ₯‡46.9 πŸ₯‡38.2 πŸ₯‡10040.7 πŸ₯‡
Lingshu (32B)37.420.2 πŸ₯ˆ20.2 πŸ₯ˆ13.7 πŸ₯ˆ1006.7 πŸ₯ˆ40.26.08.6 πŸ₯ˆ5.1 πŸ₯ˆ1000.2
MedGemma (27B)56.415.518.8 πŸ₯‰12.7 πŸ₯‰97.16.6 πŸ₯‰52.74.57.44.294.40.1
MedGemma (4B)68.6 πŸ₯‰14.618.512.498.26.477.6 πŸ₯‡4.47.44.299.10.0
Qwen2.5-VL (32B)44.814.918.412.51006.338.55.77.74.71000.6 πŸ₯‰
LLaVA-OneVision (72B)34.919.018.111.81002.434.16.3 πŸ₯‰8.4 πŸ₯‰5.0 πŸ₯‰1000.4
InternVL3 (38B)31.117.017.211.51005.329.56.6 πŸ₯ˆ7.94.91000.8 πŸ₯ˆ
Qwen2.5-VL (7B)69.6 πŸ₯ˆ12.216.711.399.35.677.4 πŸ₯ˆ3.86.53.699.60.0
Gemma3 (27B)37.112.414.910.11004.634.34.36.13.61000.3
HealthGPT-L14 (14B)27.319.4 πŸ₯‰14.99.592.01.725.65.97.14.482.60.5
MedDr (40B)53.211.114.69.696.24.163.2 πŸ₯‰3.76.23.598.50.1
HuatuoGPT-Vision (34B)21.214.112.38.080.02.217.84.05.13.176.60.3
Llama3.2-Vision (11B)41.98.610.77.170.12.543.42.33.82.168.60.0

2️⃣ Tumor/Lesion Size

Table 3: Tumor/lesion size estimation (2K samples). MAE in millimeters; MRE, SR, and MRE<0.1 in %.

Model MAE MRE SR MRE<0.1
MedVision-V0 (7B)10.5 πŸ₯‡26.0 πŸ₯‡100.023.5 πŸ₯‡
Lingshu (32B)35.7 πŸ₯ˆ118.6 πŸ₯ˆ99.54.5 πŸ₯ˆ
HuatuoGPT-Vision (34B)44.4 πŸ₯‰142.4 πŸ₯‰14.60.7
HealthGPT-L14 (14B)49.9168.6100.03.3 πŸ₯‰
Llama3.2-Vision (11B)77.1248.225.30.4
MedDr (40B)97.7312.763.40.4
Gemma3 (27B)226.0611.898.90.5
MedGemma (27B)547.61772.652.50.7
LLaVA-OneVision (72B)1016.83271.6100.01.4
Qwen2.5-VL (7B)2933.97738.995.50.7
Qwen2.5-VL (32B)2721.510471.516.50.2
InternVL3 (38B)7703.623307.5100.00.2
MedGemma (4B)728794.12293400.086.00.1

3️⃣ Angle/Distance

Table 4: Angle/distance measurement across all 12 off-the-shelf VLMs and MedVision-V0, for each sub-task. MAE in millimeters (distance) and degrees (angle); MRE, SR, and MRE<0.1 in %.

Model Ceph-Bio-400 Β· Distance (1000) Ceph-Bio-400 Β· Angle (957) FeTA24 Β· Distance (100)
MAEMRESRMRE<0.1 MAEMRESRMRE<0.1 MAEMRESRMRE<0.1
MedVision-V0 (7B)3.4 πŸ₯‡5.4 πŸ₯‡10085.3 πŸ₯‡4.7 πŸ₯‡52.1 πŸ₯‡99.952.0 πŸ₯‡5.6 πŸ₯‡15.8 πŸ₯‡10042.0 πŸ₯‡
HealthGPT-L14 (14B)19.5 πŸ₯ˆ29.7 πŸ₯ˆ95.624.1 πŸ₯ˆ32.8 πŸ₯‰727.374.98.9 πŸ₯‰28.6 πŸ₯ˆ160.370.07.0
Lingshu (32B)214.4257.610023.5 πŸ₯‰35.0512.51006.343.5148.4 πŸ₯‰1000.0
MedDr (40B)110.1175.460.45.047.3615.871.85.0136.0599.270.00.0
MedGemma (27B)28.8 πŸ₯‰48.0 πŸ₯‰33.54.742.7971.454.82.841.5194.443.02.0
Qwen2.5-VL (32B)594.71022.18.70.533.4130.5 πŸ₯ˆ7.50.11255.12515.831.00.0
Llama3.2-Vision (11B)1726.62948.517.10.338.9363.293.02.91198.53375.928.00.0
LLaVA-OneVision (72B)660.41084.999.96.439.5530.897.34.89167.539550.610012.0 πŸ₯ˆ
Gemma3 (27B)5563.47261.798.413.536.3702.299.96.735.1 πŸ₯‰173.31009.0
HuatuoGPT-Vision (34B)9607.118392.775.34.155.41045.92.20.1111.7397.859.01.0
InternVL3 (38B)14900.120754.999.76.731.0 πŸ₯ˆ553.010013.7 πŸ₯ˆ8559.142057.310011.0 πŸ₯‰
MedGemma (4B)16767.327429.495.40.135.7301.1 πŸ₯‰91.46.051.3135.8 πŸ₯ˆ87.00.0
Qwen2.5-VL (7B)68610.5101639.41000.548.0724.997.62.013536.345568.581.00.0

BibTeX

@misc{yao2025medvisiondatasetbenchmarkquantitative,
    title={MedVision: Dataset and Benchmark for Quantitative Medical Image Analysis}, 
    author={Yongcheng Yao and Yongshuo Zong and Raman Dutt and Yongxin Yang and Sotirios A Tsaftaris and Timothy Hospedales},
    year={2025},
    eprint={2511.18676},
    archivePrefix={arXiv},
    primaryClass={cs.CV},
    url={https://arxiv.org/abs/2511.18676}, 
}