π Highlights
- Research gap. Modern VLMs cannot reliably produce precise quantitative measurements from medical images.
- Dataset. MedVision β a large-scale, multi-anatomy, multi-modality dataset for quantitative medical image analysis (22 public datasets, 30.8M image-annotation pairs).
- Benchmark. The first comprehensive evaluation of contemporary VLMs on detection, tumor/lesion (T/L) size estimation, and angle/distance (A/D) measurement in medical images.
- Model. MedVision-V0, a 7B model trained on MedVision via supervised fine-tuning (SFT) and reinforcement fine-tuning (RFT); it significantly outperforms all evaluated VLMs across all three tasks β a strong, open baseline.
- Open release. Data, model, and code (training and evaluation) are all publicly available.
π― Problem & Tasks
Clinical decisions rely on quantitative assessment β measuring a tumor to stage disease, a joint angle to plan surgery, an anatomical distance to track development. We therefore target a concrete model ability: given a medical image, produce precise numeric measurements in real-world physical units (millimeters and degrees, not pixels).
MedVision evaluates this ability across three quantitative tasks:
1οΈβ£ Detection
Localize healthy anatomical structures and abnormalities with bounding boxes.
2οΈβ£ Tumor/Lesion Size
Estimate the longest diameter (major axis) and its perpendicular diameter (minor axis) of a tumor/lesion, reported in millimeters.
3οΈβ£ Angle/Distance
Measure angles (degrees) and distances (mm) from anatomical landmarks.
Figure 1: Tumor/lesion size annotation. An ellipse is fitted to the tumor/lesion mask and 4 landmarks are recorded.
Figure 2: Landmarks in the Ceph-Bio-400 (top-left) and FeTA24 datasets. Ground truth angle and distance measurements are computed from these landmarks.
π Benchmark Results
Figure 3: Per-label performance of MedVision-V0 and off-the-shelf VLMs: (a) detection recall / precision / F1, (b) tumor/lesion size MRE, and (c) angle/distance MRE.
MedVision-V0 outperforms all 12 evaluated off-the-shelf VLMs across all three quantitative task families. Each task below leads with the full leaderboard (π₯/π₯/π₯ mark the best three per column), followed by an interactive viewer of real predictions β the complete prompt, the modelβs chain-of-thought response, and the error metrics, beside the image with ground-truth-vs-prediction overlay.
1οΈβ£ Detection
Table 2: Detection performance (%), grouped into anatomy and tumor/lesion targets. R: recall; P: precision; F1: F1 score; IoU: intersection over union; SR: success rate.
| Model | Anatomy (18 regions, 13.4K) | Tumor/Lesion (8 regions, 8.5K) | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| R ↑ | P ↑ | F1 ↑ | IoU ↑ | SR ↑ | IoU>0.5 ↑ | R ↑ | P ↑ | F1 ↑ | IoU ↑ | SR ↑ | IoU>0.5 ↑ | |
| MedVision-V0 (7B) | 81.3 π₯ | 80.4 π₯ | 79.1 π₯ | 72.0 π₯ | 100 | 80.1 π₯ | 52.4 | 50.5 π₯ | 46.9 π₯ | 38.2 π₯ | 100 | 40.7 π₯ |
| Lingshu (32B) | 37.4 | 20.2 π₯ | 20.2 π₯ | 13.7 π₯ | 100 | 6.7 π₯ | 40.2 | 6.0 | 8.6 π₯ | 5.1 π₯ | 100 | 0.2 |
| MedGemma (27B) | 56.4 | 15.5 | 18.8 π₯ | 12.7 π₯ | 97.1 | 6.6 π₯ | 52.7 | 4.5 | 7.4 | 4.2 | 94.4 | 0.1 |
| MedGemma (4B) | 68.6 π₯ | 14.6 | 18.5 | 12.4 | 98.2 | 6.4 | 77.6 π₯ | 4.4 | 7.4 | 4.2 | 99.1 | 0.0 |
| Qwen2.5-VL (32B) | 44.8 | 14.9 | 18.4 | 12.5 | 100 | 6.3 | 38.5 | 5.7 | 7.7 | 4.7 | 100 | 0.6 π₯ |
| LLaVA-OneVision (72B) | 34.9 | 19.0 | 18.1 | 11.8 | 100 | 2.4 | 34.1 | 6.3 π₯ | 8.4 π₯ | 5.0 π₯ | 100 | 0.4 |
| InternVL3 (38B) | 31.1 | 17.0 | 17.2 | 11.5 | 100 | 5.3 | 29.5 | 6.6 π₯ | 7.9 | 4.9 | 100 | 0.8 π₯ |
| Qwen2.5-VL (7B) | 69.6 π₯ | 12.2 | 16.7 | 11.3 | 99.3 | 5.6 | 77.4 π₯ | 3.8 | 6.5 | 3.6 | 99.6 | 0.0 |
| Gemma3 (27B) | 37.1 | 12.4 | 14.9 | 10.1 | 100 | 4.6 | 34.3 | 4.3 | 6.1 | 3.6 | 100 | 0.3 |
| HealthGPT-L14 (14B) | 27.3 | 19.4 π₯ | 14.9 | 9.5 | 92.0 | 1.7 | 25.6 | 5.9 | 7.1 | 4.4 | 82.6 | 0.5 |
| MedDr (40B) | 53.2 | 11.1 | 14.6 | 9.6 | 96.2 | 4.1 | 63.2 π₯ | 3.7 | 6.2 | 3.5 | 98.5 | 0.1 |
| HuatuoGPT-Vision (34B) | 21.2 | 14.1 | 12.3 | 8.0 | 80.0 | 2.2 | 17.8 | 4.0 | 5.1 | 3.1 | 76.6 | 0.3 |
| Llama3.2-Vision (11B) | 41.9 | 8.6 | 10.7 | 7.1 | 70.1 | 2.5 | 43.4 | 2.3 | 3.8 | 2.1 | 68.6 | 0.0 |
2οΈβ£ Tumor/Lesion Size
Table 3: Tumor/lesion size estimation (2K samples). MAE in millimeters; MRE, SR, and MRE<0.1 in %.
| Model | MAE ↓ | MRE ↓ | SR ↑ | MRE<0.1 ↑ |
|---|---|---|---|---|
| MedVision-V0 (7B) | 10.5 π₯ | 26.0 π₯ | 100.0 | 23.5 π₯ |
| Lingshu (32B) | 35.7 π₯ | 118.6 π₯ | 99.5 | 4.5 π₯ |
| HealthGPT-L14 (14B) | 49.9 | 168.6 | 100.0 | 3.3 π₯ |
| HuatuoGPT-Vision (34B) | 44.4 π₯ | 142.4 π₯ | 14.6 | 0.7 |
| Llama3.2-Vision (11B) | 77.1 | 248.2 | 25.3 | 0.4 |
| MedDr (40B) | 97.7 | 312.7 | 63.4 | 0.4 |
| Gemma3 (27B) | 226.0 | 611.8 | 98.9 | 0.5 |
| MedGemma (27B) | 547.6 | 1772.6 | 52.5 | 0.7 |
| LLaVA-OneVision (72B) | 1016.8 | 3271.6 | 100.0 | 1.4 |
| Qwen2.5-VL (7B) | 2933.9 | 7738.9 | 95.5 | 0.7 |
| Qwen2.5-VL (32B) | 2721.5 | 10471.5 | 16.5 | 0.2 |
| InternVL3 (38B) | 7703.6 | 23307.5 | 100.0 | 0.2 |
| MedGemma (4B) | 728794.1 | 2293400.0 | 86.0 | 0.1 |
3οΈβ£ Angle/Distance
Table 4: Angle/distance measurement across all 12 off-the-shelf VLMs and MedVision-V0, for each sub-task. MAE in millimeters (distance) and degrees (angle); MRE, SR, and MRE<0.1 in %.
| Model | Ceph-Bio-400 Β· Distance (1000) | Ceph-Bio-400 Β· Angle (957) | FeTA24 Β· Distance (100) | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| MAE ↓ | MRE ↓ | SR ↑ | MRE<0.1 ↑ | MAE ↓ | MRE ↓ | SR ↑ | MRE<0.1 ↑ | MAE ↓ | MRE ↓ | SR ↑ | MRE<0.1 ↑ | |
| MedVision-V0 (7B) | 3.4 π₯ | 5.4 π₯ | 100 | 85.3 π₯ | 4.7 π₯ | 52.1 π₯ | 99.9 | 52.0 π₯ | 5.6 π₯ | 15.8 π₯ | 100 | 42.0 π₯ |
| HealthGPT-L14 (14B) | 19.5 π₯ | 29.7 π₯ | 95.6 | 24.1 π₯ | 32.8 π₯ | 727.3 | 74.9 | 8.9 π₯ | 28.6 π₯ | 160.3 | 70.0 | 7.0 |
| Lingshu (32B) | 214.4 | 257.6 | 100 | 23.5 π₯ | 35.0 | 512.5 | 100 | 6.3 | 43.5 | 148.4 π₯ | 100 | 0.0 |
| MedDr (40B) | 110.1 | 175.4 | 60.4 | 5.0 | 47.3 | 615.8 | 71.8 | 5.0 | 136.0 | 599.2 | 70.0 | 0.0 |
| MedGemma (27B) | 28.8 π₯ | 48.0 π₯ | 33.5 | 4.7 | 42.7 | 971.4 | 54.8 | 2.8 | 41.5 | 194.4 | 43.0 | 2.0 |
| Qwen2.5-VL (32B) | 594.7 | 1022.1 | 8.7 | 0.5 | 33.4 | 130.5 π₯ | 7.5 | 0.1 | 1255.1 | 2515.8 | 31.0 | 0.0 |
| Llama3.2-Vision (11B) | 1726.6 | 2948.5 | 17.1 | 0.3 | 38.9 | 363.2 | 93.0 | 2.9 | 1198.5 | 3375.9 | 28.0 | 0.0 |
| LLaVA-OneVision (72B) | 660.4 | 1084.9 | 99.9 | 6.4 | 39.5 | 530.8 | 97.3 | 4.8 | 9167.5 | 39550.6 | 100 | 12.0 π₯ |
| Gemma3 (27B) | 5563.4 | 7261.7 | 98.4 | 13.5 | 36.3 | 702.2 | 99.9 | 6.7 | 35.1 π₯ | 173.3 | 100 | 9.0 |
| HuatuoGPT-Vision (34B) | 9607.1 | 18392.7 | 75.3 | 4.1 | 55.4 | 1045.9 | 2.2 | 0.1 | 111.7 | 397.8 | 59.0 | 1.0 |
| InternVL3 (38B) | 14900.1 | 20754.9 | 99.7 | 6.7 | 31.0 π₯ | 553.0 | 100 | 13.7 π₯ | 8559.1 | 42057.3 | 100 | 11.0 π₯ |
| MedGemma (4B) | 16767.3 | 27429.4 | 95.4 | 0.1 | 35.7 | 301.1 π₯ | 91.4 | 6.0 | 51.3 | 135.8 π₯ | 87.0 | 0.0 |
| Qwen2.5-VL (7B) | 68610.5 | 101639.4 | 100 | 0.5 | 48.0 | 724.9 | 97.6 | 2.0 | 13536.3 | 45568.5 | 81.0 | 0.0 |
π¬ Pilot Study: Frontier API Models
Running API-served frontier VLMs across the entire benchmark is prohibitively costly β the test set spans multiple tasks, each with a large number of samples. We therefore conduct a pilot study that evaluates frontier API models on a small testing subset per task (Tumor/Lesion Size for now), reusing the exact prompts and samples from the full benchmark. The value of this pilot study is to benchmark how capable todayβs frontier models are at quantitative medical image measurement, facilitaing the design of agentic AI systems for biomedical applications.
Table 5: Pilot study on tumor/lesion size estimation using a small testing subset (750 samples). MAE in millimeters; MRE, SR, and MRE<0.1 in %. Cost is the total API evaluation spend in USD.
| Model | MAE ↓ | MRE ↓ | SR ↑ | MRE<0.1 ↑ | Cost |
|---|---|---|---|---|---|
| MedVision-V0 (7B) | 9.6 π₯ | 26.9 π₯ | 100.0 | 24.1 π₯ | $0 |
| Claude-Fable-5 | 12.5 | 46.5 | 100.0 | 23.7 | $63.9 |
MedVision: Benchmarking Quantitative Medical Image Analysis