MedVision Logo MedVision: Benchmarking Quantitative Medical Image Analysis

Yongcheng Yao1 , Yongshuo Zong1 , Raman Dutt1 , Yongxin Yang2 ,
Sotirios A Tsaftaris1 , Timothy Hospedales1
1University of Edinburgh    2Queen Mary University of London

MedVision is a dataset and benchmark for quantitative medical image analysis, including detection, tumor/lesion (T/L) size estimation, and angle/distance (A/D) measurement tasks.

MedVision overview

🌟 Highlights

  • Research gap. Modern VLMs cannot reliably produce precise quantitative measurements from medical images.
  • Dataset. MedVision β€” a large-scale, multi-anatomy, multi-modality dataset for quantitative medical image analysis (22 public datasets, 30.8M image-annotation pairs).
  • Benchmark. The first comprehensive evaluation of contemporary VLMs on detection, tumor/lesion (T/L) size estimation, and angle/distance (A/D) measurement in medical images.
  • Model. MedVision-V0, a 7B model trained on MedVision via supervised fine-tuning (SFT) and reinforcement fine-tuning (RFT); it significantly outperforms all evaluated VLMs across all three tasks β€” a strong, open baseline.
  • Open release. Data, model, and code (training and evaluation) are all publicly available.

🎯 Problem & Tasks

Clinical decisions rely on quantitative assessment β€” measuring a tumor to stage disease, a joint angle to plan surgery, an anatomical distance to track development. We therefore target a concrete model ability: given a medical image, produce precise numeric measurements in real-world physical units (millimeters and degrees, not pixels).

MedVision evaluates this ability across three quantitative tasks:

1️⃣ Detection

Localize healthy anatomical structures and abnormalities with bounding boxes.

2️⃣ Tumor/Lesion Size

Estimate the longest diameter (major axis) and its perpendicular diameter (minor axis) of a tumor/lesion, reported in millimeters.

3️⃣ Angle/Distance

Measure angles (degrees) and distances (mm) from anatomical landmarks.

TL samples

Figure 1: Tumor/lesion size annotation. An ellipse is fitted to the tumor/lesion mask and 4 landmarks are recorded.

ceph-feta

Figure 2: Landmarks in the Ceph-Bio-400 (top-left) and FeTA24 datasets. Ground truth angle and distance measurements are computed from these landmarks.

πŸ“ˆ Benchmark Results

MedVision-V0 vs off-the-shelf VLMs across three tasks

Figure 3: Per-label performance of MedVision-V0 and off-the-shelf VLMs: (a) detection recall / precision / F1, (b) tumor/lesion size MRE, and (c) angle/distance MRE.

MedVision-V0 outperforms all 12 evaluated off-the-shelf VLMs across all three quantitative task families. Each task below leads with the full leaderboard (πŸ₯‡/πŸ₯ˆ/πŸ₯‰ mark the best three per column), followed by an interactive viewer of real predictions β€” the complete prompt, the model’s chain-of-thought response, and the error metrics, beside the image with ground-truth-vs-prediction overlay.

1️⃣ Detection

Table 2: Detection performance (%), grouped into anatomy and tumor/lesion targets. R: recall; P: precision; F1: F1 score; IoU: intersection over union; SR: success rate.

Model Anatomy (18 regions, 13.4K) Tumor/Lesion (8 regions, 8.5K)
RPF1IoUSRIoU>0.5 RPF1IoUSRIoU>0.5
MedVision-V0 (7B)81.3 πŸ₯‡80.4 πŸ₯‡79.1 πŸ₯‡72.0 πŸ₯‡10080.1 πŸ₯‡52.450.5 πŸ₯‡46.9 πŸ₯‡38.2 πŸ₯‡10040.7 πŸ₯‡
Lingshu (32B)37.420.2 πŸ₯ˆ20.2 πŸ₯ˆ13.7 πŸ₯ˆ1006.7 πŸ₯ˆ40.26.08.6 πŸ₯ˆ5.1 πŸ₯ˆ1000.2
MedGemma (27B)56.415.518.8 πŸ₯‰12.7 πŸ₯‰97.16.6 πŸ₯‰52.74.57.44.294.40.1
MedGemma (4B)68.6 πŸ₯‰14.618.512.498.26.477.6 πŸ₯‡4.47.44.299.10.0
Qwen2.5-VL (32B)44.814.918.412.51006.338.55.77.74.71000.6 πŸ₯‰
LLaVA-OneVision (72B)34.919.018.111.81002.434.16.3 πŸ₯‰8.4 πŸ₯‰5.0 πŸ₯‰1000.4
InternVL3 (38B)31.117.017.211.51005.329.56.6 πŸ₯ˆ7.94.91000.8 πŸ₯ˆ
Qwen2.5-VL (7B)69.6 πŸ₯ˆ12.216.711.399.35.677.4 πŸ₯ˆ3.86.53.699.60.0
Gemma3 (27B)37.112.414.910.11004.634.34.36.13.61000.3
HealthGPT-L14 (14B)27.319.4 πŸ₯‰14.99.592.01.725.65.97.14.482.60.5
MedDr (40B)53.211.114.69.696.24.163.2 πŸ₯‰3.76.23.598.50.1
HuatuoGPT-Vision (34B)21.214.112.38.080.02.217.84.05.13.176.60.3
Llama3.2-Vision (11B)41.98.610.77.170.12.543.42.33.82.168.60.0

2️⃣ Tumor/Lesion Size

Table 3: Tumor/lesion size estimation (2K samples). MAE in millimeters; MRE, SR, and MRE<0.1 in %.

Model MAE MRE SR MRE<0.1
MedVision-V0 (7B)10.5 πŸ₯‡26.0 πŸ₯‡100.023.5 πŸ₯‡
Lingshu (32B)35.7 πŸ₯ˆ118.6 πŸ₯ˆ99.54.5 πŸ₯ˆ
HealthGPT-L14 (14B)49.9168.6100.03.3 πŸ₯‰
HuatuoGPT-Vision (34B)44.4 πŸ₯‰142.4 πŸ₯‰14.60.7
Llama3.2-Vision (11B)77.1248.225.30.4
MedDr (40B)97.7312.763.40.4
Gemma3 (27B)226.0611.898.90.5
MedGemma (27B)547.61772.652.50.7
LLaVA-OneVision (72B)1016.83271.6100.01.4
Qwen2.5-VL (7B)2933.97738.995.50.7
Qwen2.5-VL (32B)2721.510471.516.50.2
InternVL3 (38B)7703.623307.5100.00.2
MedGemma (4B)728794.12293400.086.00.1

3️⃣ Angle/Distance

Table 4: Angle/distance measurement across all 12 off-the-shelf VLMs and MedVision-V0, for each sub-task. MAE in millimeters (distance) and degrees (angle); MRE, SR, and MRE<0.1 in %.

Model Ceph-Bio-400 Β· Distance (1000) Ceph-Bio-400 Β· Angle (957) FeTA24 Β· Distance (100)
MAEMRESRMRE<0.1 MAEMRESRMRE<0.1 MAEMRESRMRE<0.1
MedVision-V0 (7B)3.4 πŸ₯‡5.4 πŸ₯‡10085.3 πŸ₯‡4.7 πŸ₯‡52.1 πŸ₯‡99.952.0 πŸ₯‡5.6 πŸ₯‡15.8 πŸ₯‡10042.0 πŸ₯‡
HealthGPT-L14 (14B)19.5 πŸ₯ˆ29.7 πŸ₯ˆ95.624.1 πŸ₯ˆ32.8 πŸ₯‰727.374.98.9 πŸ₯‰28.6 πŸ₯ˆ160.370.07.0
Lingshu (32B)214.4257.610023.5 πŸ₯‰35.0512.51006.343.5148.4 πŸ₯‰1000.0
MedDr (40B)110.1175.460.45.047.3615.871.85.0136.0599.270.00.0
MedGemma (27B)28.8 πŸ₯‰48.0 πŸ₯‰33.54.742.7971.454.82.841.5194.443.02.0
Qwen2.5-VL (32B)594.71022.18.70.533.4130.5 πŸ₯ˆ7.50.11255.12515.831.00.0
Llama3.2-Vision (11B)1726.62948.517.10.338.9363.293.02.91198.53375.928.00.0
LLaVA-OneVision (72B)660.41084.999.96.439.5530.897.34.89167.539550.610012.0 πŸ₯ˆ
Gemma3 (27B)5563.47261.798.413.536.3702.299.96.735.1 πŸ₯‰173.31009.0
HuatuoGPT-Vision (34B)9607.118392.775.34.155.41045.92.20.1111.7397.859.01.0
InternVL3 (38B)14900.120754.999.76.731.0 πŸ₯ˆ553.010013.7 πŸ₯ˆ8559.142057.310011.0 πŸ₯‰
MedGemma (4B)16767.327429.495.40.135.7301.1 πŸ₯‰91.46.051.3135.8 πŸ₯ˆ87.00.0
Qwen2.5-VL (7B)68610.5101639.41000.548.0724.997.62.013536.345568.581.00.0

πŸ”¬ Pilot Study: Frontier API Models

Running API-served frontier VLMs across the entire benchmark is prohibitively costly – the test set spans multiple tasks, each with a large number of samples. We therefore conduct a pilot study that evaluates frontier API models on a small testing subset per task (Tumor/Lesion Size for now), reusing the exact prompts and samples from the full benchmark. The value of this pilot study is to benchmark how capable today’s frontier models are at quantitative medical image measurement, facilitaing the design of agentic AI systems for biomedical applications.

Table 5: Pilot study on tumor/lesion size estimation using a small testing subset (750 samples). MAE in millimeters; MRE, SR, and MRE<0.1 in %. Cost is the total API evaluation spend in USD.

Model MAE MRE SR MRE<0.1 Cost
MedVision-V0 (7B)9.6 πŸ₯‡26.9 πŸ₯‡100.024.1 πŸ₯‡$0
Claude-Fable-512.546.5100.023.7$63.9

BibTeX

@misc{yao2026medvisionbenchmarkingquantitativemedical,
    title={MedVision: Benchmarking Quantitative Medical Image Analysis}, 
    author={Yongcheng Yao and Yongshuo Zong and Raman Dutt and Yongxin Yang and Sotirios A Tsaftaris and Timothy Hospedales},
    year={2026},
    eprint={2511.18676},
    archivePrefix={arXiv},
    primaryClass={cs.CV},
    url={https://arxiv.org/abs/2511.18676}, 
}