This Friday, September 30, Meta unveiled an artificial intelligence (AI) system that generates short videos from simple text instructions. Concretely, “Make-a-Video” lets you type in a series of words like “A dog wearing a superhero outfit with a red cape flies through the sky” and it then generates a five-second video clip that, well precise enough, takes on the aesthetics of an old amateur video.
Even if the effect obtained is rather rudimentary, the system offers a first glimpse of what generative artificial intelligence could offer in the future. Clearly, this will be the next step in the text-to-image AI systems that have generated a lot of buzz this year.
Meta’s announcement about “Make-a-Video” – which is not yet available to the general public – will likely spur other AI labs to develop their own versions. This technology also raises major ethical questions.
To be more effective, the new DeepMind chatbot will rely on human reactions
In the past month alone, AI lab OpenAI has made its latest text-to-image artificial intelligence system, dubbed “DALL-E”, available to everyone. For its part, the start-up Stability.AI has launched “Stable Diffusion”, an open-source image creation system.
However, text-to-image AI faces significant challenges. First of all, these models require a lot of computing power. They require an even greater computational load than other large text-to-image AI models – which use millions of images to train – because creating a single video requires hundreds of images. This means that only big tech companies can afford to develop these systems in the near future. They are also more difficult to train since large-scale datasets of high-quality video paired with text do not exist.
>> Discover 21 millions, Capital’s cryptocurrency newsletter. Advice and analysis of weekly prices to support you in your investments Right now, with the promo code CAPITAL30J, enjoy a month of free trials
In order to circumvent this problem, Meta combined data from three open access image and video datasets to train its model. Standard datasets of still images helped the AI learn the names of objects and what they look like. Additionally, a database of videos helped her learn how these objects are supposed to move. In an article that has not yet been peer-reviewed, Meta researchers explain that the combination of these two approaches has allowed “Make-a-Video” to generate, at scale, videos from texts. .
Tanmay Gupta, a computer vision researcher at the Allen Institute for Artificial Intelligence, thinks the results displayed by Meta are promising. Videos shared by the company show that the model can capture 3D shapes when the camera is rotated. The model also has some notion of depth and lighting. According to Tanmay Gupta, certain details and movements are well done and are convincing.
However, “the research community still has a long way to go, especially if these systems are to be used for video editing and professional content creation,” he says. It is still difficult to model, in particular, the complex interactions between objects.
In another video generated from the simple instruction “An artist’s brush paints on a canvas”, we can see that the brush does indeed move on the canvas but the strokes drawn are not realistic. “I would like these models to be able to generate an interaction sequence such as ‘The man takes a book from the shelf, puts on his glasses and sits down to read it while drinking a cup of coffee'”, explains Tanmay Gupta.
For its part, Meta says that this technology will “open up new possibilities for creators and artists”. But as the technology develops, observers fear it will be exploited as a powerful tool to create and spread false information or wrong wrong. With it, it could become even more difficult to differentiate the true from the false on the Internet.
Meta’s model ups the ante when it comes to generative AI both technically and creatively but also “in terms of the unique harms that could be caused by generated video as opposed to still images,” says Henry Ajder, a synthetic media expert.
“Today, creating factually inaccurate content that people might believe still takes some effort,” adds Tanmay Gupta. “In the future, it may be possible to create misleading content just by typing a few words on a keyboard.”
The developers of “Make-a-Video” have added filters to hide offensive images and words. However, with datasets consisting of millions of words and images, it is almost impossible to completely eliminate biased and harmful content.
A Meta spokesperson says the company has yet to make the model available to the general public and that “as part of this research, [elle] will continue to explore ways to refine and mitigate potential risks.”
An article by Melissa Heikkilä, translated from English by Kozi Pastakia.
Virtual reality to relieve patients during surgery?
Receive our latest news
Every morning, the information to remember on the financial markets.