Toward Aligned Vision Models - Robotics Institute Carnegie Mellon University
Loading Events

PhD Thesis Proposal

February

18
Wed
Gautam Rajendrakumar Gare PhD Student Robotics Institute,
Carnegie Mellon University
Wednesday, February 18
10:00 am to 11:30 am
3305 Newell-Simon Hall
Toward Aligned Vision Models
Abstract:

Modern vision and vision–language models (VLMs) achieve remarkable perceptual performance, yet their internal representations often misalign with human-understandable concepts, clinical reasoning, or the causal structure of data. Such misalignment limits trust, generalization, and safety – particularly in high-stakes domains such as medical imaging. This thesis proposes a comprehensive framework for model alignment, developing methods that bring model representations closer to clinically meaningful, causally grounded, and task-adaptive concepts across diverse visual tasks.

First, I introduce biomarker-grounded alignment for lung ultrasound (LUS), where domain-informed interpretable biomarkers serve as anchors to structure deep model representations. I develop methods that disentangle anatomical, morphological, and artifact-level biomarkers and demonstrate that these aligned representations improve interpretability while matching or exceeding fully supervised baselines across diagnostic and severity scoring tasks.

Second, I propose a causal feature selection framework based on Markov blanket discovery to identify minimal yet causally relevant feature subsets across medical and non-medical datasets. By uncovering natural experiments embedded in observational data, the method reveals features that are inherently robust, reduces spurious correlations, and provides theoretical and empirical evidence for improved generalization and interpretability.

Third, I explore prompt-tuning–based alignment for object detection, showing that positive and negative few-shot exemplars can be leveraged for iterative prompt optimization. This strategy steers models toward task-relevant visual concepts, improves detector robustness under domain shifts, and reveals interpretable activation patterns associated with object-level reasoning.
Collectively, these contributions establish a cohesive strategy for aligning vision models with human-understandable, causally grounded, and task-relevant representations, advancing the development of reliable, interpretable, and generalizable systems suitable for real-world clinical and broader perceptual deployment.

Finally, I outline two future research directions to further strengthen alignment and its evaluation. The first explores gradient-based soft prompt tuning, which learns continuous prompt embeddings through backpropagation to enable more stable and scalable prompt optimization. The second develops saliency mapping for VLMs by evaluating region-proposal masks (e.g., from SAM) based on their impact on downstream performance, measured through VQA-style scoring, enabling quantitative assessment of visual explanation quality and faithfulness.

 
Thesis Committee Members:
John Galeotti, Co-chair
Deva Ramanan, Co-chair
Zachary Lipton
Trevor Darrell, University of California, Berkeley
 
A draft of the thesis proposal document is available here.