AI Castaway

OpenAI's Sora: Did That Really Happen?

Seeing is no longer believing, thanks to OpenAI's latest generative AI model.

Welcome to the inaugural installment of our new column, AI Castaway! Written by Microsoft Data Platform MVP, international speaker and technical consultant Ginger Grant, AI Castaway will take a technologist's lens to AI's use and development, including how and when to use it, different implementations, and the impact of AI on applications and processes. If you would like to engage directly with Ginger around the topic of this and future installments, you can find her on X (formerly Twitter) @desertislesql.

Last week marked another new announcement from OpenAI that will once again make it harder for people to determine whether what they are seeing is real or computer-generated.

OpenAI is adding to the capability of humans to use AI to perform a new task: creating videos. This announcement will take CGI to a whole new level, as the computer-generated imagery can be created using text prompts from ChatGPT. This new library is called "Sora."

The 'Sky' Is the Limit
Sora is an amazing new model that allows you to generate video based on inputting a text description. For those of you who were wondering if Sora is an acronym, it is not; Sora is the word for "sky" in Japanese.

The videos Sora creates look incredible. OpenAI states that it can generate videos up to a minute in length and up to 2048x2048 in resolution. The video examples OpenAI included in its announcement show an outstanding level of detail.

What is truly amazing when looking at the sample videos in OpenAI's Sora unveiling is the variations in style that are possible. The videos include cartoons with a Pixar look and feel, as well as detailed and realistic views of people, landscapes and environments, including a video that looks like something you might see taken by a scuba diver.

You are not constrained by the real world; if you like, you can send fish swimming through a city or create a video of a digital Minecraft-style scene.

Sora contains a number of different background elements so you can locate your video in the desert or in a forest. The models have been built to include different focal lengths and occlusion, as well. This means that people can walk in front of a dog in a window but the focus remains on the dog while the people walking by appear fuzzy. It also means the dog is persisted before and after the people walk by.

Sora starts by creating a grainy-looking video, then sharpens it over time, which is known as diffusion. Those of you who have created images using OpenAI's image generator, DALL-E, will be familiar with how to generate an image, as Sora draws heavily on concepts generated for that model.

Sora generates videos using a text-to-video generative model that will rely upon ChatGPT to describe scenes. In addition to creating movies based on descriptions, Sora will be able to generate a video from a picture to animate the contents and extend it by adding additional frames. Sora is designed to extend a video forward and backward, which is all you need to create an endless loop like you might see in a GIF. OpenAI can also cut videos of different topics together into a longer video by slowly adding elements from one video into another until the contents are merged.

Creatives at a Crossroads
The data used to train Sora was actual videos recorded in their original aspect ratios. To be sure, Hollywood and the gaming industry will be investigating this tool heavily to ensure that their digital copyrights have not been violated when Sora was trained. In fact, as soon as OpenAI unveiled Sora last week, studios raised questions as to the source of the training videos.

On the flip side, studios will also likely investigate how they can use Sora to decrease the cost of CGI, which is the mainstay of special effects. It isn't farfetched to expect Sora might lower the cost of making movies and games -- as well as the salaries of those who currently create them.

What CAN'T Sora Do?
There are some limitations to this initial version of Sora. OpenAI admits that it has problems with physics, as it cannot render glass breaking.

Also, the act of a human eating food is not a task the AI understands, so when you create a video with a person eating food, the amount of food left on their plate does not decrease. If you make a video of a person eating a sandwich, the sandwich might still look whole after they take a bite.

Unfortunately, the general public is not able to try out Sora just yet, or even see a working demo. OpenAI is still working on safety measures before being ready to release it. For instance, it wants to make sure that Sora cannot be used to generate sexual content, create a video with a celebrity as the subject, or use someone else's movie or other copyrighted material.

OpenAI is doing testing and final modifications with a small group of testers, including MIT. As soon as it is available, I am guessing we will see the output on TikTok.

What Now?
When this technology is released to the public, it will be even harder to determine truth from fiction. The video generated from Sora can be so realistic that you could easily believe what you are seeing really happened.

Malformed hands are one of the few dead giveaways that a picture is AI-generated. It will be interesting to see if OpenAI ever figures out how to render hands appropriately in video. Right now, unless the video shows glass breaking or someone eating -- the two things that OpenAI admitted that Sora has problems depicting -- it may be impossible to tell reality from AI.

About the Author

Ginger Grant is a Data Platform MVP who provides consulting services in advanced analytic solutions, including machine learning, data warehousing, and Power BI. She is an author of articles, books, and at DesertIsleSQL.com and uses her MCT to provide data platform training in topics such as Azure Synapse Analytics, Python and Azure Machine Learning. You can find her on X/Twitter at @desertislesql.

Featured