Type: Ph.D. Thesis Proposal
Evaluation. Existing agent evaluations often focus on well-structured tasks and final outcomes, failing to fully capture the complexity of real-world workflows. We propose evaluation frameworks grounded in realistic machine learning engineering workflows, providing skill-based, multi-artifact, and holistic assessments that systematically evaluate the practical utility of AI agents.
Learning. Improving LLMs for agentic use typically relies on reinforcement learning with large amounts of high-quality labeled data, which are costly and difficult to obtain in expert domains including healthcare. To address this limitation, we aim to develop learning frameworks that require minimal external supervision, improving the scalability and efficiency of agent learning.
Specialization. AI agents typically follow a one-size-fits-all paradigm at the time of deployment, lacking mechanisms to account for task-specific or user-specific requirements. We propose methods that enable agent specialization for downstream tasks and users, expanding their applicability across heterogeneous deployment settings.
This thesis aims to make AI agents more broadly accessible and impactful in important real-world applications by enhancing their practical utility, making them more measurable, more capable, and better tailored to the needs of their users and applications.
Andrea Bajcsy
Barnabás Póczos
Daniel McDuff (Google)
