The next AI revolution could start with world models

This Scientific American piece argues that many modern AI failures come from a simple gap: lots of systems are great at predicting the next token or the next frame, but they don’t maintain a persistent, internally consistent representation of the world. That’s why generated videos can “morph” objects (a dog’s collar vanishes, a chair becomes a different chair) and why today’s vision-language models often struggle with basic physical reasoning.

“World models” are presented as a direction that could close that gap, especially via 4D representations (3D plus time) that update as inputs stream in. The article connects recent work like NeRF-style approaches and new preprints that turn monocular videos into dynamic 4D scene models, then uses those models to stabilize generation or synthesize novel viewpoints. Beyond video, the same idea shows up as a practical requirement for augmented reality (stable virtual objects and occlusion), robotics and self-driving (better navigation and prediction), and even longer-term agentic systems that need memory and planning grounded in an evolving scene.

Read the original