Modern vision and vision–language models (VLMs) achieve remarkable perceptual performance, yet their internal representations often misalign with human-understandable concepts, clinical reasoning, or the causal structure of data. Such misalignment limits trust, generalization, and safety – particularly in high-stakes domains such as medical imaging. This thesis proposes a comprehensive framework for model alignment, developing methods that bring model representations closer to clinically meaningful, causally grounded, and task-adaptive concepts across diverse visual tasks.
First, I introduce biomarker-grounded alignment for lung ultrasound (LUS), where domain-informed interpretable biomarkers serve as anchors to structure deep model representations. I develop methods that disentangle anatomical, morphological, and artifact-level biomarkers and demonstrate that these aligned representations improve interpretability while matching or exceeding fully supervised baselines across diagnostic and severity scoring tasks.
Second, I propose a causal feature selection framework based on Markov blanket discovery to identify minimal yet causally relevant feature subsets across medical and non-medical datasets. By uncovering natural experiments embedded in observational data, the method reveals features that are inherently robust, reduces spurious correlations, and provides theoretical and empirical evidence for improved generalization and interpretability.
Finally, I outline two future research directions to further strengthen alignment and its evaluation. The first explores gradient-based soft prompt tuning, which learns continuous prompt embeddings through backpropagation to enable more stable and scalable prompt optimization. The second develops saliency mapping for VLMs by evaluating region-proposal masks (e.g., from SAM) based on their impact on downstream performance, measured through VQA-style scoring, enabling quantitative assessment of visual explanation quality and faithfulness.
