Colloquium - Manling Li, "Mechanistic Science of Multimodal Models"
From cs-speakerseries
views
comments
From cs-speakerseries
Abstract:
Multimodal alignment is often treated as a black-box training recipe. We argue that the key question is mechanistic: how does cross-modal alignment form inside the model, and when does it preserve geometry versus collapse into language-driven shortcuts? We study it as a representation learning problem: understand how alignment emerges in internal states, shape it with geometry structured priors, and control it to remain reliable. First, we open up multimodal models to probe the internal mechanisms of alignment and to intervene when geometry-relevant structure is preserved or lost. Second, we introduce VAGEN, framing multimodal representation learning as multi-stage prior injection: injecting vision and map/world-model priors so that latent states become structured abstractions that support “think as a map”, not just token-level matching. Finally, we present ODE-Steer, which controls alignment by steering internal activation into a safe subregion defined by control barrier functions, where representations and reasoning are controllable. True multimodal intelligence requires more than aligning tokens; it requires aligning the internal physics of the model with the geometry of the world.
Bio:
Multimodal alignment is often treated as a black-box training recipe. We argue that the key question is mechanistic: how does cross-modal alignment form inside the model, and when does it preserve geometry versus collapse into language-driven shortcuts? We study it as a representation learning problem: understand how alignment emerges in internal states, shape it with geometry structured priors, and control it to remain reliable. First, we open up multimodal models to probe the internal mechanisms of alignment and to intervene when geometry-relevant structure is preserved or lost. Second, we introduce VAGEN, framing multimodal representation learning as multi-stage prior injection: injecting vision and map/world-model priors so that latent states become structured abstractions that support “think as a map”, not just token-level matching. Finally, we present ODE-Steer, which controls alignment by steering internal activation into a safe subregion defined by control barrier functions, where representations and reasoning are controllable. True multimodal intelligence requires more than aligning tokens; it requires aligning the internal physics of the model with the geometry of the world.