A few thoughts on Sora

Yesterday, OpenAI announced Sora, a new product that generates realistic video from text prompts.¹ The examples are remarkable.

A TV writer friend texted me to ask “is it time to be petrified?”

I wrote back:

I don’t think you need to be petrified. It’s very impressive at creating video in a way that’s like how Dall-E does images. A huge achievement. For pre-viz? Mood reels? Incredible. We’ll see stuff coming out of it used in commercials first.

For longer, narrative stuff, there’s a real challenge moving from text generation (gpt-4 putting together something that looks like a script) to “filming” that script with these tools to resemble anything like our movies and television.

Writers, directors, actors and crew have a sense of why they’re doing what they’re doing, and what makes sense in this fictitious reality they’re creating. I don’t think you can do that without consciousness, without self-awareness, and if/when AI gets there, stuff like Sora will be the least of our concerns.

With a night to sleep on it, I think there are a few larger, more immediate concerns. Writers (and humans in general) should be aware of but not petrified by some of the implications of this technology beyond the obvious ones like deepfakes and disinformation.

Video as input. Like image generators, this technology can work off of a text prompt. But you can also feed it video and have it change things. Do you want A Few Good Men, but with Muppets? Done. Need to replace Kevin Spacey in a movie? No need to reshoot anything. Just let Sora do it.
Remake vs. refresh. Similarly, any existing film or television episode could be “redone” with this technology. In some cases, that could mean a restoration or visual effects refresh, like George Lucas did with Star Wars. Or it could be what we’d consider a remake, where the original writer gets paid. What’s the difference between a refresh and a remake, and who decides?
Animation vs. live action. How do we define the video material that comes out of Sora? It can look like live action, but wasn’t filmed with cameras. It can look like animation, but it didn’t come out of an animation process. This matters because while the WGA represents writers of both live action and animation, studios are not currently required to use WGA writers in animation. We can’t let this technology to be used as an end-run around WGA (and other guild) jurisdiction.
Reality engines. In a second paper, OpenAI notes that Sora could point to “general purpose simulators of the physical world.” The implications go far beyond any disruptive effects on Hollywood, and are worth a closer look.

It seems like a long way to go from videos of cute paper craft turtles to The Matrix, but it’s worth taking the progress they’ve made here seriously. In generating video, Sora does a few things that are really difficult, and resemble human developmental milestones.

Like all models, Sora is predictive, making guesses about what just happened and what happens next. But it feels different because it’s doing this in a 3D space that largely tracks with our lived experience. It remembers objects, even if they’re not on screen at the moment, and recognizes interactions between objects, such as paintbrushes leaving marks on the canvas.²

Sora makes mistakes, but the results surprisingly good for a system that wasn’t explicitly trained to do anything other than generate video. Those capabilities could be used to do other things. In a jargon-heavy paragraph, OpenAI notes:

Sora is also able to simulate artificial processes — one example is video games. Sora can simultaneously control the player in Minecraft with a basic policy while also rendering the world and its dynamics in high fidelity. These capabilities can be elicited zero-shot by prompting Sora with captions mentioning “Minecraft.”

Sora “gets” Minecraft because it’s ingested countless hours of Minecraft videos. If it’s able to create a simulation of the game that is indistinguishable from the original, is there really a difference? If it’s able to create a convincing simulation of reality based on the endless video it scapes, what are the implications for “our” reality?

These are questions for philosophers, sure, but we’re all going to be faced with them sooner than we’d like. Sora and its descendants are going to have an impact beyond the cool video they generate.

Sora is a great name, btw. It doesn’t mean anything, and doesn’t have any specific connotation, yet feels like something that should exist. ↩
Not to dive too deeply into theories of human consciousness, but the ability to internally model reality and predict things feel like table stakes. ↩

Related Posts