Multimodal Foundation Models: Alignment, Evaluation, Reasoning & Generation

The evolution from text-based Large Language Models (LLMs) to Multimodal Large Foundation Models (MLFMs) marks a significant shift in AI capabilities. This topic covers the transition from unified multimodal understanding to generation, highlighting the challenges of cross-modal alignment and the need for rigorous quality assessment.

Multimodal Alignment: Different from traditional multimodal models that often rely on an “anchor” modality (usually text) to align other modalities, we explore Anchorless Multimodal Alignment, which aims to align all modalities into a unified concept space without emphasizing one over the others. This also enables us to explore the issues of missing knowledge in face of missing or partial modalities in LFM training. Through this, we hope to teach models to learn knowledge beyond what can be expressed in text.

Quality Assessment: We aim to scale evaluation methodologies by extending reinforcement learning to general domains through two approaches. The first is to utilize chain-of-thought prompting to break down a complex visual reasoning task into AI-verifiable logical steps. This can be applied to text and other modalities such as images, videos and 3D media. The other approach is to develop robust probability-based reward signals, eliminating the need for handcrafted verifiers. A longer-term goal is to learn verifiable reward signals automatically from vertical domain data.

Controllable and Personalized Generation: The current approach of relying on text-only prompts for content generation is limited and inadequate. We explore a fully specified, personalized, and domain-aware generation paradigm by incorporating a complete specification that includes text and multimodal prompts, user context, interaction history, and domain constraints. This enables the generated outputs to be both multimodal aligned (e.g., spatiotemporal consistency in video or preference-guided 3D synthesis) and tailored to individual users’ preferences, styles, and long-term goals. This extension treats generation as an intelligent conditional synthesis under evolving constraints, requiring new control mechanisms, personalization-aware representations, and evaluation protocols that assess specification compliance, temporal consistency, and user-aligned optimality rather than single-prompt fidelity.

RAG for Fine-Grained Multimodal Retrieval: We explore a RAG framework to tackle fine-grained recognition as an iterative retrieve-and-reason process with a dual-tool interface. By alternating candidate shortlisting via category-list retrieval, and within-label exemplar verification via exemplar-image retrieval, the system can accumulate additional retrieval evidence for ambiguous cases and scale test-time computation towards higher performance. The framework has been applied to biodiversity research that requires fine-grained image classification over extremely large label spaces, where many species are visually similar and labeled data is scarce.