Scalable Training of Spatially Grounded
2D Vision–Language Models for Radiology

1Computer Vision Group, University of Freiburg, Germany
2Adaptive & Agentic AI (A3) Lab, Aarhus University, Denmark
3Department of Radiology, Medical Center – University of Freiburg, Germany
4CRIION-AI Lab, Freiburg, Germany
*Equal contribution   Equal supervision

News

  • 18 June 2026: Preprint released
  • 12 June 2026: Accepted to MICCAI 2026

Overview

RefRad2D grounding-annotation pipeline overview

Generating spatial grounding labels without manual annotation. We process full 3D CT and MR volumes with TotalSegmentator to produce 3D segmentation masks, and extract anatomical keywords from the matching report with an LLM (gpt-oss-120B). A strict set-intersection filter (CT ∩ CI ≠ ∅) keeps only slices where the mentioned entity is actually present, yielding dense, high-quality 2D pixel-level annotations that link textual findings to image regions.

Abstract

We study how to train visually grounded vision–language models (VLMs) for radiology without manual spatial annotations. We introduce RefRad2D, a large-scale bilingual (German/English) dataset of 1.2M CT and MR image–text pairs derived from clinical practice, with task-specific VQA and spatial grounding subsets generated automatically via LLM-based curation and automated segmentation.

Trained on this data, our model RadGrounder jointly performs report generation, visual question answering, and spatial grounding via bounding-box detection or segmentation. On external VQA benchmarks (Slake, VQA-RAD), RadGrounder achieves competitive results with specialized medical VLMs. Adding our clinical data to the training mixture improves open-ended VQA over fine-tuning on the downstream datasets alone, showing the transferability of our dataset. Crucially, adding grounding supervision does not degrade language quality, enabling spatially verifiable outputs at no cost to VQA performance.

Method

RefRad2D dataset. RefRad2D contains 760,409 CT and 256,197 MR image–report pairs derived from clinical routine. From these we automatically build two task-specific subsets: RefRad2D-Grounded, with 236,157 grounded slice–text pairs (217,692 CT, 18,465 MR), and RefRad2D-VQA (~9.6M QA pairs), generated with Gemma 3 (27B) as five QA pairs per image from clinical findings plus three from slice metadata. Grounding labels are obtained by running TotalSegmentator on the 3D volumes (harmonized to a unified schema of C = 121 anatomical classes) and matching the resulting masks to anatomical mentions extracted from each report.

Architecture. RadGrounder builds on PaliGemma 2 (3B), comprising a SigLIP-So400m vision encoder and a Gemma-2B language decoder. The model processes an image into visual tokens that are concatenated with text tokens and processed autoregressively. We investigate two grounding strategies on top of this foundation.

Bounding-box detection. We treat spatial grounding as a text-generation task, extending the vocabulary with coordinate tokens (512 bins) and class-identifying tokens. The model generates a structured sequence <p bbox> [LOC] id=<segID> KEYWORD </p>, where [LOC] denotes the box coordinates and id maps to the unified anatomical schema. A class-wise merging strategy resolves multiple instances of the same organ.

Segmentation head. We also explore pixel-level grounding via a lightweight mask decoder following VividMed and SAM. When the model emits a </seg> token, its hidden state is projected into a 256-dim prompt that drives a Two-Way Transformer over the image embeddings to produce a binary mask. The model is trained end-to-end with an autoregressive cross-entropy loss Ltxt; the segmentation variant adds an auxiliary loss L = Ltxt + λseg(Lfocal + Ldice).

Evaluation: LLMScore & G-IoU. Because n-gram metrics (e.g., CIDEr) struggle with the semantic equivalence of medical text, we introduce LLMScore, an LLM-as-a-judge metric (Gemma 3 27B) on a 5-point scale. Validated against three radiologists, it reaches an inter-annotator agreement of Krippendorff's α = 0.958 and a Pearson correlation of r = 0.977 with mean human scores. We additionally propose Grounding-IoU (G-IoU), which jointly measures spatial and semantic fidelity by weighting per-entity spatial IoU with the cosine similarity of the entity text embeddings.

Qualitative Example

Detection grounding example: RadGrounder localizes a nodular lesion in the left lower lobe on a chest CT

Detection grounding on a chest CT (lung window). RadGrounder localizes a nodular lesion in the left lower lobe, generating a bounding box (prediction in red) that closely matches the ground truth (green) both spatially and semantically — G-IoU 0.95. The box is produced jointly with the generated report, so the model points to exactly where in the image its finding comes from.

Results

On the external VQA benchmarks Slake and VQA-RAD, RadGrounder is competitive with specialized medical VLMs. With our standard configuration (detection grounding, frozen fine-tuned SigLIP), it reaches Slake open-question F1 87.7 and closed-question accuracy 90.3 — comparable to BiomedGPT-B (85.2 / 89.9) and LLaVA-Med (86.8 closed acc), and surpassing Med-Gemini (75.8 / 84.8) and MedGemma (72.3 / 87.6). On VQA-RAD it achieves the highest open-question F1 (50.7) among compared methods. All numbers are 95% bootstrap means (B = 10,000); best per column in bold, “–” = not reported for that split.

Method Slake VQA-RAD
F1 (open)Acc (closed) F1 (open)Acc (closed)
PaliGemma 2 (3B)24.558.137.955.6
Gemma 3 (4B)40.253.033.633.6
MedGemma (4B)72.387.649.969.1
BiomedGPT-B85.289.9
LLaVA-Med86.8
RadFM (14B)84.4
Med-Gemini75.884.878.8
RadGrounder (ours)87.790.350.764.7

Both released checkpoints are PaliGemma-2 (3B) with a frozen fine-tuned SigLIP encoder, trained on report + VQA + SLAKE + VQA-RAD plus one grounding task. All models were trained on a single NVIDIA H100 GPU for 6 epochs (~2.5 days). Adding grounding supervision leaves language quality intact, yielding a single model that can answer questions, write reports, and point to where in the image its answer comes from.

Acknowledgements

This research was funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation). We thank the Department of Radiology at the Medical Center – University of Freiburg for providing the clinical data underlying RefRad2D. The clinical RefRad2D dataset is private and not distributed; the public SLAKE and VQA-RAD benchmarks work out of the box.

BibTeX

@inproceedings{salcan2026radgrounder,
  author    = {Yusuf Salcan and Simon Ging and Robin Schirrmeister and Philipp Arnold and Elmar Kotter and Behzad Bozorgtabar and Thomas Brox},
  title     = {Scalable Training of Spatially Grounded 2{D} Vision--Language Models for Radiology},
  booktitle = {Medical Image Computing and Computer Assisted Intervention -- {MICCAI} 2026},
  series    = {Lecture Notes in Computer Science},
  publisher = {Springer},
  year      = {2026},
  note      = {To appear}
}