Dall-E, Craiyon, Midjourney… If you’ve been hanging out on social networks in recent weeks, you’ve probably already heard of these artificial intelligences dedicated to creating images. With a simple query in the form of text, these software can create an image from scratch, which attempts to illustrate the given command. Want GoPro photos of the French Revolution? No problem. A painting of Karl Marx in the style of Eugène Delacroix? Dall-E takes care of that. A disturbing drawing of Sonic taking over the one ring? This way !
The possibilities offered by these new AIs are dizzying, each with a specific specialty and style. The success of these programs is such that even TikTok recently added a similar feature to illustrate your videos. Google is working on not one, but two AIs of the same kind, Parti and Imagen, which produce images with sometimes dizzying photorealism.
A surge on social networks
While the first instance of Dall-E was born in 2018, the explosion in popularity of these entirely computer-generated images took place when OpenAI (a company partially founded by Elon Musk) introduced its Dall-E 2 tool in April. 2022. Reserved for a small elite, the software was quickly emulated and the illustrations of Dall-E Mini – which had the advantage of being accessible to everyone – flooded social networks. The service even renamed itself to Craiyon so as not to maintain confusion with the Dall-E project, to which it is not affiliated at all.
“There are many technical locks that have been broken thanks to advances in research”explains Juliette Rengot, computer engineer and specialist in machine learning and computer image processing. “The ability of computers to manage and calculate a large amount of data simultaneously has also facilitated the emergence of this kind of tool”adds the expert. Graphics processors, chips dedicated to artificial intelligence and all the progress in the raw computing power of our machines have therefore helped to give birth to these kinds of tools.
“The concept is to train two models concurrently. A ‘generator’ which receives random input data and generates an output image, and a ‘discriminator’ which receives the image and judges whether it is realistic or not”details Juliette Rengot. Over time, the generator will improve by learning from the discriminator, then only the first one will be kept to become the AI.
Depending on the data ingested by the generator and the discriminator, the “personality” of these AIs can change. Midjourney, for example, produces strange spooky images when Dall-E sometimes insists more on realism. This is not only the result of a very particular configuration of the model, but also of the database to which he had access. If the discriminator has been trained, for example with cartoon images, it will only send a green light to the generator if it produces images resembling what it knows.
The other side of these AIs is of course the understanding of the input data, in our case the text descriptions provided by Internet users. By scanning the text corpora available for machine learning, the generator will learn to grasp the meaning of a text, to understand the differences between several synonyms, to know how a sentence is typically constructed. This allows the machine to have a clear idea of what users are looking for.
If you’ve ever used Dall-E, Midjourney, or other such AIs, you may have noticed that some formulations “fool” machines more often than others. To speak well to the machine, “you have to try to be as close as possible to the training data, to at least respect the rules of grammar and spelling. In fact, it is necessary to use the most common vocabulary possible”advises Juliette Rengot.
limitations in understanding
This is not the only limitation of these AIs. In a paper detailing how Parti works, Google engineers explain, for example, that “in complex scenes, the model can sometimes omit certain details mentioned, duplicate them or hallucinate things that are not mentioned”. When Parti was asked to create an image containing tennis balls, for example, most results placed them on a clay court or lawn, even though the descriptions did not mention this aspect. It’s simply because the most frequent images of tennis balls all contain a background of this type.
“The beauty is that the model can find relationships in the data that humans couldn’t find”explains Juliette Rengot. In other cases, it is misunderstandings due to the formulation that give funny (or disturbing, it depends) results. For example, when asked to create an image of a horse riding an astronaut, Parti produced an image of a horse astronaut riding another horse, the term “ridden” often being linked to being on the back of a horse. Faced with the request of a horse sitting on the shoulders of an astronaut, Parti fared much better.
“The big limit is to manage to have images in normal size. But it is not unsurpassable”says Juliette Rengot. Indeed, the more the image is detailed, the more there is a risk of visible errors that will break the impression of photorealism. This is why human faces are so often abused by AI. “Faces are really a particularly difficult theme. The skin is biological and living, so it’s much more variable than an object. In addition, the human brain is made to recognize faces, so we will see the defects much more quickly.” A seemingly innocuous error in the representation of a face will quickly appear disturbing to the human eye. This is the principle of the disturbing valley.
Multiple uses in sight
Even if the results of Dall-E, Midjourney and others often seem stunning, we are far from the fantasy of a conscious AI that would understand everything that is said to it and would be able to do everything. Firstly because, as we have seen, they are limited by the data they are fed. But mostly because these tools are actually different bricks stacked together, some wearing a big trench coat to give the illusion that the software has only one head.
“A lot of times when you develop this stuff, you start with a specific point and then little by little you add bricks. This makes it possible to do proofs of concept, to integrate the work of other research teams”explains the specialist in machine learning. For example, one AI brick knows how to reproduce objects and animals perfectly well, another knows the differences in style of the great famous painters, another specializes in the reconstruction of a natural landscape. By working together, they are able to produce a painting of a fox sitting in the middle of a field in the style of Claude Monet.
Beyond the technically stunning nature of these AIs, computer image generation also has very concrete advantages. “When we generate faces, they don’t belong to anyone, so it’s not protected by the GDPR. This allows you to create avatars without taking risks”points out Juliette Rengot. We can also imagine AIs that illustrate articles in the press, for example. One thing is certain, these models of artificial intelligence are still in their infancy, and the coming years will probably hold surprises in the field. Let’s hope that this new generation of AI will have learned from the old ones and will fight a little better against sexist and racist clichés.