π Highlights
- Research gap. Modern VLMs cannot reliably produce precise quantitative measurements from medical images.
- Dataset. MedVision β a large-scale, multi-anatomy, multi-modality dataset for quantitative medical image analysis (22 public datasets, 30.8M image-annotation pairs).
- Benchmark. The first comprehensive evaluation of contemporary VLMs on detection, tumor/lesion (T/L) size estimation, and angle/distance (A/D) measurement in medical images.
- Model. MedVision-V0, a 7B model trained on MedVision via supervised fine-tuning (SFT) and reinforcement fine-tuning (RFT); it significantly outperforms all evaluated VLMs across all three tasks β a strong, open baseline.
- Open release. Data, model, and code (training and evaluation) are all publicly available.
π― Problem & Tasks
Clinical decisions rely on quantitative assessment β measuring a tumor to stage disease, a joint angle to plan surgery, an anatomical distance to track development. We therefore target a concrete model ability: given a medical image, produce precise numeric measurements in real-world physical units (millimeters and degrees, not pixels).
MedVision evaluates this ability across three quantitative tasks:
1οΈβ£ Detection
Localize healthy anatomical structures and abnormalities with bounding boxes.
2οΈβ£ Tumor/Lesion Size
Estimate the longest diameter (major axis) and its perpendicular diameter (minor axis) of a tumor/lesion, reported in millimeters.
3οΈβ£ Angle/Distance
Measure angles (degrees) and distances (mm) from anatomical landmarks.
Figure 1: Tumor/lesion size annotation. An ellipse is fitted to the tumor/lesion mask and 4 landmarks are recorded.
Figure 2: Landmarks in the Ceph-Bio-400 (top-left) and FeTA24 datasets. Ground truth angle and distance measurements are computed from these landmarks.
π Benchmark Results
Figure 3: Per-label performance of MedVision-V0 and off-the-shelf VLMs: (a) detection recall / precision / F1, (b) tumor/lesion size MRE, and (c) angle/distance MRE.
MedVision-V0 outperforms all 12 evaluated off-the-shelf VLMs across all three quantitative task families. Each task below leads with the full leaderboard (π₯/π₯/π₯ mark the best three per column), followed by an interactive viewer of real predictions β the complete prompt, the modelβs chain-of-thought response, and the error metrics, beside the image with ground-truth-vs-prediction overlay.
1οΈβ£ Detection
Table 2: Detection performance (%), grouped into anatomy and tumor/lesion targets. R: recall; P: precision; F1: F1 score; IoU: intersection over union; SR: success rate.
| Model | Anatomy (18 regions, 13.4K) | Tumor/Lesion (8 regions, 8.5K) | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| R ↑ | P ↑ | F1 ↑ | IoU ↑ | SR ↑ | IoU>0.5 ↑ | R ↑ | P ↑ | F1 ↑ | IoU ↑ | SR ↑ | IoU>0.5 ↑ | |
| MedVision-V0 (7B) | 81.3 π₯ | 80.4 π₯ | 79.1 π₯ | 72.0 π₯ | 100 | 80.1 π₯ | 52.4 | 50.5 π₯ | 46.9 π₯ | 38.2 π₯ | 100 | 40.7 π₯ |
| Lingshu (32B) | 37.4 | 20.2 π₯ | 20.2 π₯ | 13.7 π₯ | 100 | 6.7 π₯ | 40.2 | 6.0 | 8.6 π₯ | 5.1 π₯ | 100 | 0.2 |
| MedGemma (27B) | 56.4 | 15.5 | 18.8 π₯ | 12.7 π₯ | 97.1 | 6.6 π₯ | 52.7 | 4.5 | 7.4 | 4.2 | 94.4 | 0.1 |
| MedGemma (4B) | 68.6 π₯ | 14.6 | 18.5 | 12.4 | 98.2 | 6.4 | 77.6 π₯ | 4.4 | 7.4 | 4.2 | 99.1 | 0.0 |
| Qwen2.5-VL (32B) | 44.8 | 14.9 | 18.4 | 12.5 | 100 | 6.3 | 38.5 | 5.7 | 7.7 | 4.7 | 100 | 0.6 π₯ |
| LLaVA-OneVision (72B) | 34.9 | 19.0 | 18.1 | 11.8 | 100 | 2.4 | 34.1 | 6.3 π₯ | 8.4 π₯ | 5.0 π₯ | 100 | 0.4 |
| InternVL3 (38B) | 31.1 | 17.0 | 17.2 | 11.5 | 100 | 5.3 | 29.5 | 6.6 π₯ | 7.9 | 4.9 | 100 | 0.8 π₯ |
| Qwen2.5-VL (7B) | 69.6 π₯ | 12.2 | 16.7 | 11.3 | 99.3 | 5.6 | 77.4 π₯ | 3.8 | 6.5 | 3.6 | 99.6 | 0.0 |
| Gemma3 (27B) | 37.1 | 12.4 | 14.9 | 10.1 | 100 | 4.6 | 34.3 | 4.3 | 6.1 | 3.6 | 100 | 0.3 |
| HealthGPT-L14 (14B) | 27.3 | 19.4 π₯ | 14.9 | 9.5 | 92.0 | 1.7 | 25.6 | 5.9 | 7.1 | 4.4 | 82.6 | 0.5 |
| MedDr (40B) | 53.2 | 11.1 | 14.6 | 9.6 | 96.2 | 4.1 | 63.2 π₯ | 3.7 | 6.2 | 3.5 | 98.5 | 0.1 |
| HuatuoGPT-Vision (34B) | 21.2 | 14.1 | 12.3 | 8.0 | 80.0 | 2.2 | 17.8 | 4.0 | 5.1 | 3.1 | 76.6 | 0.3 |
| Llama3.2-Vision (11B) | 41.9 | 8.6 | 10.7 | 7.1 | 70.1 | 2.5 | 43.4 | 2.3 | 3.8 | 2.1 | 68.6 | 0.0 |
2οΈβ£ Tumor/Lesion Size
Table 3: Tumor/lesion size estimation (2K samples). MAE in millimeters; MRE, SR, and MRE<0.1 in %.
| Model | MAE ↓ | MRE ↓ | SR ↑ | MRE<0.1 ↑ |
|---|---|---|---|---|
| MedVision-V0 (7B) | 10.5 π₯ | 26.0 π₯ | 100.0 | 23.5 π₯ |
| Lingshu (32B) | 35.7 π₯ | 118.6 π₯ | 99.5 | 4.5 π₯ |
| HuatuoGPT-Vision (34B) | 44.4 π₯ | 142.4 π₯ | 14.6 | 0.7 |
| HealthGPT-L14 (14B) | 49.9 | 168.6 | 100.0 | 3.3 π₯ |
| Llama3.2-Vision (11B) | 77.1 | 248.2 | 25.3 | 0.4 |
| MedDr (40B) | 97.7 | 312.7 | 63.4 | 0.4 |
| Gemma3 (27B) | 226.0 | 611.8 | 98.9 | 0.5 |
| MedGemma (27B) | 547.6 | 1772.6 | 52.5 | 0.7 |
| LLaVA-OneVision (72B) | 1016.8 | 3271.6 | 100.0 | 1.4 |
| Qwen2.5-VL (7B) | 2933.9 | 7738.9 | 95.5 | 0.7 |
| Qwen2.5-VL (32B) | 2721.5 | 10471.5 | 16.5 | 0.2 |
| InternVL3 (38B) | 7703.6 | 23307.5 | 100.0 | 0.2 |
| MedGemma (4B) | 728794.1 | 2293400.0 | 86.0 | 0.1 |
3οΈβ£ Angle/Distance
Table 4: Angle/distance measurement across all 12 off-the-shelf VLMs and MedVision-V0, for each sub-task. MAE in millimeters (distance) and degrees (angle); MRE, SR, and MRE<0.1 in %.
| Model | Ceph-Bio-400 Β· Distance (1000) | Ceph-Bio-400 Β· Angle (957) | FeTA24 Β· Distance (100) | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| MAE ↓ | MRE ↓ | SR ↑ | MRE<0.1 ↑ | MAE ↓ | MRE ↓ | SR ↑ | MRE<0.1 ↑ | MAE ↓ | MRE ↓ | SR ↑ | MRE<0.1 ↑ | |
| MedVision-V0 (7B) | 3.4 π₯ | 5.4 π₯ | 100 | 85.3 π₯ | 4.7 π₯ | 52.1 π₯ | 99.9 | 52.0 π₯ | 5.6 π₯ | 15.8 π₯ | 100 | 42.0 π₯ |
| HealthGPT-L14 (14B) | 19.5 π₯ | 29.7 π₯ | 95.6 | 24.1 π₯ | 32.8 π₯ | 727.3 | 74.9 | 8.9 π₯ | 28.6 π₯ | 160.3 | 70.0 | 7.0 |
| Lingshu (32B) | 214.4 | 257.6 | 100 | 23.5 π₯ | 35.0 | 512.5 | 100 | 6.3 | 43.5 | 148.4 π₯ | 100 | 0.0 |
| MedDr (40B) | 110.1 | 175.4 | 60.4 | 5.0 | 47.3 | 615.8 | 71.8 | 5.0 | 136.0 | 599.2 | 70.0 | 0.0 |
| MedGemma (27B) | 28.8 π₯ | 48.0 π₯ | 33.5 | 4.7 | 42.7 | 971.4 | 54.8 | 2.8 | 41.5 | 194.4 | 43.0 | 2.0 |
| Qwen2.5-VL (32B) | 594.7 | 1022.1 | 8.7 | 0.5 | 33.4 | 130.5 π₯ | 7.5 | 0.1 | 1255.1 | 2515.8 | 31.0 | 0.0 |
| Llama3.2-Vision (11B) | 1726.6 | 2948.5 | 17.1 | 0.3 | 38.9 | 363.2 | 93.0 | 2.9 | 1198.5 | 3375.9 | 28.0 | 0.0 |
| LLaVA-OneVision (72B) | 660.4 | 1084.9 | 99.9 | 6.4 | 39.5 | 530.8 | 97.3 | 4.8 | 9167.5 | 39550.6 | 100 | 12.0 π₯ |
| Gemma3 (27B) | 5563.4 | 7261.7 | 98.4 | 13.5 | 36.3 | 702.2 | 99.9 | 6.7 | 35.1 π₯ | 173.3 | 100 | 9.0 |
| HuatuoGPT-Vision (34B) | 9607.1 | 18392.7 | 75.3 | 4.1 | 55.4 | 1045.9 | 2.2 | 0.1 | 111.7 | 397.8 | 59.0 | 1.0 |
| InternVL3 (38B) | 14900.1 | 20754.9 | 99.7 | 6.7 | 31.0 π₯ | 553.0 | 100 | 13.7 π₯ | 8559.1 | 42057.3 | 100 | 11.0 π₯ |
| MedGemma (4B) | 16767.3 | 27429.4 | 95.4 | 0.1 | 35.7 | 301.1 π₯ | 91.4 | 6.0 | 51.3 | 135.8 π₯ | 87.0 | 0.0 |
| Qwen2.5-VL (7B) | 68610.5 | 101639.4 | 100 | 0.5 | 48.0 | 724.9 | 97.6 | 2.0 | 13536.3 | 45568.5 | 81.0 | 0.0 |
MedVision: Dataset and Benchmark for Quantitative Medical Image Analysis