To address this issue, recent research has proposed vehicle-to-vehicle (V2V) and vehicle-to-everything (V2X) cooperative perception for autonomous driving. In such systems, connected autonomous vehicles (CAVs) share their individual perception features with one another to improve overall cooperative detection accuracy. However, most existing work focuses solely on the cooperative detection task, without leveraging temporal information about the dynamic environment or considering other critical components of autonomous driving, such as prediction and planning.
To broaden the scope of cooperative driving research, my proposed doctoral research aims to explore multimodal large language model (LLM)–based cooperative autonomous driving, motivated by several potential advantages of LLMs. First, a single LLM-based model offers the flexibility to perform multiple tasks, including perception, prediction, and planning, within a unified framework. Second, LLMs exhibit strong generalizability due to large-scale pretraining on diverse data. Third, LLM-based driving models possess reasoning capabilities that enable them to handle long-tail driving scenarios that may not appear in the training data. Fourth, natural language can serve as an effective and efficient communication interface for V2V, V2X, and human–vehicle interactions.
We have developed multimodal LLM-based cooperative autonomous driving architectures that enable end-to-end cooperative driving and generate suggested future trajectories for all CAVs through V2V communication. In addition, we have designed a graph-of-thoughts reasoning framework to further enhance the reliability and interpretability of our multimodal LLM-based architecture. Finally, we propose to develop a decentralized V2V framework using multimodal LLMs to improve the scalability and feasibility of future large-scale deployment.
