ECML-PKDD 2025 (CCF B) · arXiv · Code (GitHub)


Figure 1: VCT example (OCT2017)

Input image, concept output without perturbation, and concept output after perturbation — illustrating why SVCT is needed for stable explanations.

VCT example: concept with/without perturbation


Abstract

Transparency is a paramount concern in medical AI. Concept Bottleneck Models (CBMs) restrict the latent space to human-understandable concepts, but existing methods often rely solely on concept features for prediction, overlooking intrinsic feature embeddings in medical images, and can be unstable under input perturbations. We propose Vision Concept Transformer (VCT), which uses ViT as the backbone and a label-free conceptual layer, and fuses concept features with image features for decision-making to preserve accuracy while remaining interpretable. We then propose Stable Vision Concept Transformer (SVCT) by integrating Denoised Diffusion Smoothing (DDS) so that the top-k concept indices and predictions remain stable under perturbations, providing faithful explanations. Experiments on four medical datasets (HAM10000, Covid19-CT, BloodMNIST, OCT2017) show that VCT and SVCT maintain accuracy and interpretability, and SVCT provides stable explanations under perturbations.


Figure 2: SVCT framework overview

SVCT framework

Overview of the Stable Vision Concept Transformer (SVCT) model.


Method: VCT and SVCT

VCT (Vision Concept Transformer)

  • ViT backbone plus a Concept Bottleneck Layer learned in a label-free way (e.g. CLIP-Dissect).
  • Concept features f_c(X) are fused with backbone features: F(X) = concat(f(X), W_c f(X)) for the final classifier, avoiding the accuracy drop of concept-only CBMs.

SVCT (Stable VCT)

  • Apply Denoised Diffusion Smoothing: add Gaussian noise to token embeddings, then denoise with a diffusion model.
  • This yields a stable concept module: (i) explanation stability — top-k concept overlap ≥ β under perturbation; (ii) prediction robustness — Rényi divergence between predictions bounded.
  • Default noise level σ = 8/255; evaluation uses perturbation radii ρ_u ∈ [6/255, 10/255].

Experiments (from paper)

Datasets: HAM10000, Covid19-CT, BloodMNIST, OCT2017. Perturbation: PGD radius ρ_u ∈ [6/255, 10/255]; DDS noise σ = 8/255.

Accuracy (with/without perturbation)

Method HAM10000 Covid19-CT BloodMNIST OCT2017
Standard (No interpretability) 99.13% 81.62% 97.05% 99.70%
Label-Free CBM 93.61% 79.75% 94.97% 97.50%
Post-hoc CBM 97.60% 76.26% 94.83% 98.60%
VCT 99.00% 80.62% 96.21% 99.10%
SVCT 99.05% 81.37% 96.96% 99.50%
ρ_u=8/255 − LF-CBM 90.08% 67.98% 80.53% 91.88%
ρ_u=8/255 − SVCT 97.97% 74.45% 94.07% 98.70%
ρ_u=10/255 − LF-CBM 88.70% 65.12% 75.63% 90.58%
ρ_u=10/255 − SVCT 97.24% 71.65% 92.65% 98.48%

Stability: CFS and CPCS

CFS ↓ (lower is better), CPCS ↑ (higher is better). SVCT consistently achieves best stability.

Method HAM10000 (CFS/CPCS) Covid19-CT (CFS/CPCS) BloodMNIST (CFS/CPCS) OCT2017 (CFS/CPCS)
ρ_u=6/255 − LF-CBM 0.3335 / 0.9405 0.6022 / 0.8117 0.5328 / 0.8511 0.3798 / 0.9254
ρ_u=6/255 − SVCT 0.1354 / 0.9900 0.5555 / 0.8359 0.3589 / 0.9320 0.3257 / 0.9468
ρ_u=8/255 − LF-CBM 0.3719 / 0.9256 0.6707 / 0.7710 0.6280 / 0.7947 0.3941 / 0.9196
ρ_u=8/255 − SVCT 0.1555 / 0.9867 0.6446 / 0.7818 0.4383 / 0.8977 0.3459 / 0.9387
ρ_u=10/255 − LF-CBM 0.4027 / 0.9123 0.7224 / 0.7336 0.6906 / 0.7545 0.4055 / 0.9145
ρ_u=10/255 − SVCT 0.1725 / 0.9836 0.7096 / 0.7389 0.5058 / 0.8625 0.3620 / 0.9321

Sensitivity and specificity

Method HAM10000 (Sen/Spec) Covid19-CT (Sen/Spec) BloodMNIST (Sen/Spec) OCT2017 (Sen/Spec)
Label-free CBM 0.8878 / 0.9827 0.7984 / 0.8608 0.9407 / 0.9956 0.9750 / 0.9960
SVCT 0.9899 / 0.9999 0.8191 / 0.8037 0.9667 / 0.9958 0.9950 / 0.9994
ρ_u=10/255 − LF CBM 0.6779 / 0.9615 0.5794 / 0.9810 0.5880 / 0.9998 0.8380 / 0.9880
ρ_u=10/255 − SVCT 0.9180 / 0.9932 0.7136 / 0.9303 0.8681 / 0.9948 0.9790 / 0.9923

Ablation (DDS)

Both denoising and smoothing are essential. Best accuracy and stability (lowest CFS, highest CPCS) when both are enabled. See paper for full ablation tables.

Model configuration & computational cost

Config (paper Table 4): batch_size 512, saga_batch_size 256, proj_batch_size 5000, clip_cutoff 0.25, proj_steps 1000, interpretability_cutoff 0.45, lam 0.0007, n_iters 1000, ρ_u ∈ [6/255, 10/255], S = 8/255.

Cost: ViT 85.8M params / 17.56 GFLOPS; Label-free CBM 85.76M (+40K); SVCT 85.85M (+43K), same GFLOPS.


Visualizations

Figure 3: Concept visualization — one sample per dataset, before vs after perturbation.

Concept visualization before/after perturbation

Figure 4: Concept-intervention examples — experts can correct concept predictions for co-diagnosis.

Concept-intervention examples

Per-dataset concept weight visualizations (one sample each: Covid19-CT, BloodMNIST, HAM10000, OCT2017).

Concept weights Covid19-CT Concept weights BloodMNIST Concept weights HAM10000 Concept weights OCT2017


Code & Repo

Citation

@inproceedings{hu2025stable,
  title     = {Stable Vision Concept Transformers for Medical Diagnosis},
  author    = {Hu, Lijie and Lai, Songning and Hua, Yuan and Yang, Shu and Zhang, Jingfeng and Wang, Di},
  booktitle = {Proceedings of the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML-PKDD)},
  year      = {2025}
}

← Back to Home