World Models and The Future of Film
AI will give birth to new forms of media. The craft of film will shift as we transition to production based on world models.
Every week a new advance, a new model, a new technique for prompting.
AI-generated videos are racking up hundreds of millions of views and creating a new category of ‘viral’, whether clever or Biblical.
But it isn’t all memes or baby Michael Jackson. AI “film” is approaching its moment.
We’re not at full-length movies yet, but generative AI is increasingly handling the challenges of character consistency, audio, lip synch, camera angles and shots, even lenses or special effects.
But tucked away behind the latest advances is the true ‘affordance’ of video AI: consistent world model construction.
It’s from world models that a new film grammar will emerge if not a new type of media entirely.
AI Consumes And Then It Creates
A few years ago I wrote that AI has the distinction that it’s the first type of media that both consumes the media that came before it and is a distinct media in its own right.
Television followed radio. But the tools for TV production didn’t produce radio broadcasts.
AI, on the other hand, can produce books or podcasts, video or memes.
But AI itself is a type of media. The act of prompting is a media experience, even if you’re generating an image that you post on Reddit. The interfaces and wrappers built on top of the ‘generative’ part of AI have both the capacity to create media and are content distribution channels.Midjourney Runway is the new Netflix.
But yes. We’re still early.
TV looked a lot like plays until it started to look like TV.
TV was the broadcast and not the camera, until we could buy a video camera and bring it home with us.
AI looks a lot like memes or videos. We replicate what came before.
The gap between the tools themselves (the ‘cameras’) being broadly distributed has been erased.
And so we’re in the home video camera era. Everyone has access to the tools of production. And yet we’re still mostly replicating what came before.
Spatial Reasoning As Film Grammar
Is it a video game? An interactive video? Instead of prompting an image and waiting for it to appear, what if AI generation happened in real-time while you engage with it?
Or what happens when you can edit any video by adding elements while preserving lighting and camera consistency?
Or swap out moods, lighting, backgrounds?
Well, first - it changes the burden of production. No more ‘reshoots’ or ‘fix it in post’. Editing a video is as trivial as typing in a prompt.
But behind the scenes, something more profound is going on: because these capabilities are grounded in world models.
Which means that the next evolution of film making isn’t planning your shot list, it’s planning your world model.
World Models As The New Back Lot
In the physical world, making a movie is an act of collaboration. Directors and cinematographers, actors and makeup artists.
In generative AI, the ‘tool-makers’ are experts in algorithms and process. And so today, a lot of the focus is on controlling the randomness inherent to generative AI:
How do I produce video with the right camera lens, panning shot or detail when I do a close-up?
How do I maintain character consistency across shots?
How can I get the character’s voices to match their lips, how do I give them emotional depth?
But world models suggest that there will be a new way to think about AI-assisted film-making.
It might look a lot more like filming in Unreal Engine (on its own a form of technical magic - ask me about my experience filming on an LED set!).
“Directors” won’t just prompt ‘shots’, they’ll prompt the characters, locations and items, the narrative world models, the backstories
“Filming” won’t just be about stitching together a bunch of video clips, it will be an act of creating world structures and then exploring those world models to find the right shot, the right tempo
The craft of AI film making will return to being a collaborative enterprise. Today, the joy might be that any single person can generate a decent video in their home office. But tomorrow, the “stack” will be deeper and interoperable.
Today, you generate a character (creating still images, say) and you write some dialogue.
Tomorrow, you’ll grab an interoperable character asset and they’ll ‘act’ in your film. Three different characters might give three different outcomes, much like casting different actors can result in radically different movies.
World Models Are The Frontier of AI. “Film” Is Already There
Behind all of this is the emergence of world models as the next frontier for AI - one that will power a new era of world simulations, robotics and, yes, the chats you have with your AI companion.
Today, AI hallucinates and has no real frame of reference other than the ‘next token’. Most of the work ‘around’ AI (especially LLMs) is on stuffing the context window, building memory, trying to brute-force and frame of reference.
AI film and images are already at the leading edge of where AI is headed: AI that has a ‘frame of reference’ because its ground in world models.
These world models will be the source of the next shift in media. It’s similar to the shift from “TV-as-play” to “TV-as-TV”.
AI will find its new ‘form’ when we give creators the tools to bring narrative, spatial, and production affordances into world models.
What they create might look a lot like film.
But just like Netflix shows don’t look much like Leave It To Beaver, tomorrow’s hits will clearly have their own ‘film grammar’, distinct connections to their audiences and fans, and new channels for their consumption.
We have a new art form emerging. Look to world models to figure out where it’s headed.
I’ve been deep into building with AI, including a world-building narrative stack (SOJEN).
I love getting email and starting a conversation, and it’s more interesting when it’s with a real person. Feel free to comment on the Substack app, email me at doug@sojen.io or message me on X.
Let's chat.