Meta announces Make-A-Video, which generates video from text

Enlarge / Still image from an AI-generated video of a teddy bear painting a portrait.

Today, Meta announced Make-A-Video, an AI-powered video generator that can create novel video content from text or image prompts, similar to existing image synthesis tools like DALL-E and Stable Diffusion. It can also make variations of existing videos, though it’s not yet available for public use.

On Make-A-Video’s announcement page, Meta shows example videos generated from text, including “a young couple walking in heavy rain” and “a teddy bear painting a portrait.” It also showcases Make-A-Video’s ability to take a static source image and animate it. For example, a still photo of a sea turtle, once processed through the AI model, can appear to be swimming.

The key technology behind Make-A-Video—and why it has arrived sooner than some experts anticipated—is that it builds off existing work with text-to-image synthesis used with image generators like OpenAI’s DALL-E. In July, Meta announced its own text-to-image AI model called Make-A-Scene.

Instead of training the Make-A-Video model on labeled video data (for example, captioned descriptions of the actions depicted), Meta instead took image synthesis data (still images trained with captions) and applied unlabeled video training data so the model learns a sense of where a text or image prompt might exist in time and space. Then it can predict what comes after the image and display the scene in motion for a short period.

A video of a teddy bear painting a portrait, created with Meta’s Make-A-Video AI model (converted to GIF for display here).
A video of “a young couple walking in a heavy rain” created with Make-A-Video.
Video of a sea turtle, animated from a still image with Make-A-Video.

“Using function-preserving transformations, we extend the spatial layers at the model initialization stage to include temporal information,” Meta wrote in a white paper. “The extended spatial-temporal network includes new attention modules that learn temporal world dynamics from a collection of videos.”

Meta has not made an announcement about how or when Make-A-Video might become available to the public or who would have access to it. Meta provides a sign-up form people can fill out if they are interested in trying it in the future.

Meta acknowledges that the ability to create photorealistic videos on demand presents certain social hazards. At the bottom of the announcement page, Meta says that all AI-generated video content from Make-A-Video contains a watermark to “help ensure viewers know the video was generated with AI and is not a captured video.”

If history is any guide, competitive open source text-to-video models may follow (some, like CogVideo, already exist), which could make Meta’s watermark safeguard irrelevant.

Source