<img src="figure/medvision-logo-lite.png" alt="MedVision Logo" style="height: 1.5em; vertical-align: middle;"> MedVision: Dataset and Benchmark for Quantitative Medical Image Analysis

🌟 Highlights

Research Gap: modern VLMs are not reliably able to produce precise quantitative measurements from medical images.
Dataset: MedVision, a large-scale, multi-modality dataset for quantitative medical image analysis, covering 22 public datasets and 30.8M images with structured measurement annotations.
Benchmark: the first comprehensive evaluation of contemporary VLMs on detection, tumor/lesion size estimation, and angle/distance measurement.
Supervised Finetuning (SFT): SFT can improve the performance of VLMs on quantitative medical image analysis tasks.
Code and Models: data, model checkpoints, and code (training and evaluation) are available

📀 Dataset

MedVision includes 22 public datasets and 30.8M image-annotation pairs. The dataset is available at here. Details are as follows:

Table 1: The MedVision dataset consists of public medical images and quantitative annotations from this study. MRI: Magnetic Resonance Imaging; CT: Computed Tomography; PET: positron emission tomography; US: Ultrasound; b-box: bounding box; T/L: tumor/lesion size; A/D: angle/distance; HF: HuggingFace; GC: Grand-Challenge; ^† redistributed.

Dataset	Anatomy	Modality	Annotation	Availability	Source	# Sample (Train / Test)
Dataset	Anatomy	Modality	Annotation	Availability	Source	b-box	T/L	A/D
AbdomenAtlas	abdomen	CT	b-box	open	HF	6.8 / 2.9M	0	0
AbdomenCT-1K	abdomen	CT	b-box	open	Zenodo	0.7 / 0.3M	0	0
ACDC	heart	MRI	b-box	open	HF^†, others	9.5 / 4.8K	0	0
AMOS22	abdomen	CT, MRI	b-box	open	Zenodo	0.8 / 0.3M	0	0
autoPEI-III	whole body	CT, PET	b-box, T/L	open	HF^†, others	22 / 9.7K	0.5 / 0.2K	0
BCV15	abdomen	CT	b-box	open	HF^†, Synapse	71 / 30K	0	0
BraTS24	brain	MRI	b-box, T/L	open	HF^†, Synapse	0.8 / 0.3M	7.9 / 3.1K	0
CAMUS	heart	US	b-box	open	HF^†, others	0.7 / 0.3M	0	0
Ceph-Bio-400	head and neck	X-ray	b-box, A/D	open	HF^†, others	0	0	5.3 / 2.3K
CrossModDA	brain	MRI	b-box	open	HF^†, Zenodo	3.0 / 1.0K	0	0
FeTA24	fetal brain	MRI	b-box, A/D	registration	Synapse	34 / 15K	0	0.2 / 0.1K
FLARE22	abdomen	CT	b-box	open	HF^†, others	72 / 33K	0	0
HNTSMRG24	head and neck	MRI	b-box, T/L	open	Zenodo	18 / 6.6K	1.0 / 0.4K	0
ISLES24	brain	MRI	b-box	open	HF^†, GC	7.3 / 2.5K	0	0
KiPA22	kidney	CT	b-box, T/L	open	HF^†, GC	26 / 11K	2.1 / 1.0K	0
KiTS23	kidney	CT	b-box, T/L	open	HF^†, GC	80 / 35K	5.9 / 2.6K	0
MSD	multiple	CT, MRI	b-box, T/L	open	others	0.2 / 0.1M	5.3 / 2.2K	0
OAIZIB-CM	knee	MRI	b-box	open	HF	0.5 / 0.2M	0	0
SKM-TEA	knee	MRI	b-box	registration	others	0.2 / 0.1M	0	0
ToothFairy2	tooth	CT	b-box	registration	others	1.0 / 0.4M	0	0
TopCoW24	brain	CT, MRI	b-box	open	HF^†, Zenodo	43 / 20K	0	0
TotalSegmentator	multiple	CT, MRI	b-box	open	HF^†, Zenodo	9.6 / 4.0M	0	0
Total						22 / 9.2M	23 / 9.6K	5.6 / 2.4K

Q: How to use the dataset?

Quick start
Prepare dataset for MedVision benchmarking and finetuning: here

1️⃣ Detection

Q: Can VLMs localize healthy anatomical structures and abnormalities from medical images?

Pretrained VLMs show limited ability in medical image detection tasks.
With SFT, VLMs achieve dramatic improvement in the detection of both healthy anatomical structures and tumors/lesions.
Small object and tumors/lesions detection is still challenging.
SFT improves models’ generalizability (details in paper)

Group-level detection performance

Table 2: VLM performance on detection tasks. Targets are grouped into health anatomy and tumor/lesion detection tasks. Mean metrics weighted by sample sizes are reported in %. R: recall; P: precision; F1: F1 score; IoU: intersection over union; SR: success rate.

Model	Anatomy						Tumor/Lesion
Model	R ↑	P ↑	F1 ↑	IoU ↑	SR ↑	IoU_>0.5 ↑	R ↑	P ↑	F1 ↑	IoU ↑	SR ↑	IoU_>0.5 ↑
Qwen2.5-VL (7B)	50.4	9.5	12.4	7.6	100	1.4	54.5	3.4	5.6	3.1	100	0.0
Qwen2.5-VL (32B)	35.2	13.6	16.1	11.0	99.7	5.7	38.0	6.2	8.4	5.1	99.9	0.7
Lingshu (32B)	26.4	19.1	17.3	11.6	100	4.0	23.1	5.9	7.2	4.4	100	0.6
InternVL3 (38B)	24.5	14.5	14.5	9.8	100	5.3	39.2	5.9	7.6	4.5	100	0.4
Gemma3 (27B)	33.3	17.4	19.4	13.7	100	9.9	33.4	5.1	6.6	4.0	100	0.5
MedGemma (4B)	66.2	16.3	19.0	12.8	100	7.2	72.1	5.2	8.6	4.9	99.6	0.2
MedGemma (27B)	65.6	17.3	19.3	12.9	100	6.5	65.8	5.7	8.8	5.1	100	0.2
Llama3.2-Vision (11B)	47.0	8.1	10.4	6.8	73.8	2.7	45.6	2.1	3.6	2.0	70.1	0.0
LLava-OneVision (72B)	36.4	17.8	18.5	12.3	100	4.4	38.1	6.0	8.5	5.1	100	0.4
LLaVA-Med-v1.5 (7B)	60.7	15.7	18.6	12.6	99.1	7.0	50.6	4.7	7.3	4.3	89.0	0.8
MedDr (40B)	64.4	11.9	17.3	11.3	99.8	4.1	74.9	4.4	7.5	4.3	97.8	0.1
HuatuoGPT-Vision (34B)	25.3	20.5	16.3	10.7	100	3.6	20.8	5.9	7.0	4.3	100	0.6
HealthGPT-L14	21.0	15.2	13.5	8.3	100	0.5	22.7	6.2	6.8	4.2	100	0.5
Gemini2.5-Flash (w/o tool)	35.4	16.1	18.7	12.8	99.5	6.8	41.4	7.1	10.1	6.3	98.3	1.2
Gemini2.5-Flash (w tool)	29.9	13.0	15.2	10.6	82.0	5.8	38.4	9.1	10.2	6.9	77.8	3.6
Qwen2.5-VL (7B, SFT_1M)	80.6	79.1	78.2	71.6	100	79.5	55.8	51.6	49.4	41.2	100	46.2
Qwen2.5-VL (32B, SFT_1M)	82.1	83.4	81.2	74.6	100	82.8	58.7	54.8	52.6	44.3	100	49.5

Label-level detection performance

This figure shows the detection metrics per label and box-to-image ratio (that indicates relative target size) for each models.

detection metrics per label and box size

2️⃣ Tumor/Lesion Size Estimation

Q: Can VLMs estimate the size of tumors/lesions from medical images?

Mean relative error (MRE) for SFT models are ∼ 30%, while it is 50~120% for off-the-shelf VLMs
SFT improves models’ generalizability (details in paper)

VLMs are asked to estimate the longest diameter and its perpendicular diameter of tumors/lesions. An example of the quantitative annotations from this work is shown below.

Figure 2: Tumor/lesion size annotation. An ellipse is fitted to the tumor/lesion mask and 4 landmarks are recorded.

Table 3: VLM performance on tumor/lesion size estimation tasks. Mean relative error (MRE), success rate (SR), and MRE_<k are reported in %, while mean absolute error (MAE) is in millimeters.

Model	MAE ↓	MRE ↓	SR ↑	MRE_<0.1 ↑	MRE_<0.2 ↑	MRE_<0.3 ↑
Qwen2.5-VL (7B)	32.3	71.5	100.0	0.3	1.7	3.2
Qwen2.5-VL (32B)	24.2	51.9	100.0	2.7	11.8	19.9
Lingshu (32B)	23.9	66.1	100.0	10.4	29.7	43.4
InternVL3 (38B)	22.6	50.1	100.0	4.5	15.0	24.2
Gemma3 (27B)	30.7	70.7	100.0	1.4	4.9	8.7
MedGemma (4B)	38.9	116.8	100.0	1.1	4.5	7.6
MedGemma (27B)	38.9	116.9	100.0	1.1	4.3	7.5
Llama3.2-Vision (11B)	25.7	61.9	99.1	4.6	14.0	21.1
LLava-OneVision (72B)	26.2	83.0	100.0	4.8	17.7	29.4
LLaVA-Med-v1.5 (7B)	48.8	74.7	22.6	0.3	0.6	1.1
MedDr (40B)	30.1	73.8	100.0	1.6	4.7	9.4
HuatuoGPT-Vision (34B)	28.9	89.1	100.0	8.5	26.2	42.3
HealthGPT-L14	23.6	61.3	98.9	8.0	22.6	35.8
Qwen2.5-VL (7B, SFT_5K)	13.2	30.6	100.0	20.8	41.5	62.1
Qwen2.5-VL (32B, SFT_5K)	12.8	30.2	100.0	19.6	43.2	63.2

3️⃣ Angle/Distance Measurement

Q: Can VLMs measure angles or distances from medical image?

Off-the-shelf VLMs fail to measure angles and distances accurately.
SFT can significantly improve performance.
Small-angle measurements remain challenging.

VLMs are prompted with task description and definition of angle or distance. Examples of landmarks in the Ceph-Bio-400 and FeTA24 datasets are shown below.

Figure 3: Landmarks in the Ceph-Bio-400 (top-left) and FeTA24 datasets. Ground truth angle and distance measurements are calculated from these landmarks.

Table 4: VLM performance on angle and distance measurement tasks. Mean relative errors (MRE), MRE_<0.1, and success rate (SR) are reported in %. Mean absolute errors (MAE) are given in millimeters (distance) and degrees (angle).

Model	Ceph-Bio-400								FeTA24
	Distance				Angle				Distance
	MAE ↓	MRE ↓	SR ↑	MRE_<0.1 ↑	MAE ↓	MRE ↓	SR ↑	MRE_<0.1 ↑	MAE ↓	MRE ↓	SR ↑	MRE_<0.1 ↑
Qwen2.5-VL (7B)	56.3	80.1	100	0.9	55.1	3691	98.7	3.4	27.5	61.6	100	2.0
Qwen2.5-VL (32B)	59.7	86.0	100	0.2	41.0	3287	100	2.0	19.5	49.6	100	6.0
Lingshu (32B)	17.3	26.1	100	15.5	31.2	4648	100	10.4	38.5	120.9	100	0.0
InternVL3 (38B)	22.1	38.2	100	11.1	66.1	4405	100	4.3	19.7	54.7	100	11.0
Gemma3 (27B)	22.6	32.3	100	22.4	33.6	7862	100	13.8	18.1	40.2	100	15.0
MedGemma (4B)	43.4	67.9	100	6.5	30.5	2780	100	5.8	26.2	60.4	100	5.0
MedGemma (27B)	43.9	68.8	100	6.4	29.9	2168	100	6.4	26.2	60.4	100	5.0
Llama3.2-Vision (11B)	81.6	117.4	100	2.9	55.3	9318.6	100	8.0	29.9	73.4	100	2.0
LLava-OneVision (72B)	36.3	67.8	100	18.8	62.6	12269.1	100	0.0	46.0	130.7	100	4.0
MedDr (40B)	33.7	53.6	100	6.0	54.8	9149.2	100	0.2	37.1	78.5	100	5.0
HuatuoGPT-Vision (34B)	45.8	65.2	100	2.8	65.8	5741.1	100	9.0	40.5	83.3	100	15.0
HealthGPT-L14	66.8	97.5	100	1.1	42.9	434.6	100	0.0	32.1	79.4	100	5.0
Qwen2.5-VL (7B, SFT_5K)	3.5	5.4	100	86.4	3.6	126.8	100	50.1	4.1	13.1	100	49.0
Qwen2.5-VL (32B, SFT_5K)	4.5	7.9	100	83.5	4.0	540.2	100	50.5	4.1	14.0	100	54.0

MedVision: Dataset and Benchmark for Quantitative Medical Image Analysis

MedVision is a dataset and benchmark for quantitative medical image analysis, including detection, tumor/lesion (T/L) size estimation, and angle/distance (A/D) measurement tasks.