Stability AI has announced the release of Stable Diffusion version 2.0. It brings many improvements. The most important new feature is the improvement of the OpenCLIP text-to-image model. However, Stable Diffusion data is almost entirely in the English language. They assume that texts and images from non-English speaking cultures and communities would be largely ignored.
In August 2022, Stability AI startup, together with RunwayML, LMU Munich, EleutherAI and LAION, released Stable Diffusion and announced the first stage of Stable Diffusion to researchers. Stability.AI wanted to build a DALL-E 2 alternative, and they would have ended up doing a lot more. For some analysts, the Stable Diffusion embodies the best features of the AI art world: it is arguably the best model of open source AI art in existence. It’s simply unheard of and it will have enormous consequences, says one of them.
Stable Diffusion is a text-to-image latent diffusion model. Thanks to a generous computational donation from Stability AI and support from LAION, the researchers were able to train a latent scattering model on 512×512 images from a subset of the LAION-5B database. Similar to Imagen from Google, this template uses a CLIP ViT-L/14 gel text encoder to package the template for text prompts. With its 860M UNet and 123M text encoder, the model is relatively lightweight and runs on a GPU with at least 10GB of VRAM.
Stable Diffusion can be used online against payment and with content filters, or downloaded for free and used locally without content restrictions. Version 2.0 continues this open source approach. At the top of the list is Stability AI.
Improved text encoder and new image modes
For version 2.0, the team used OpenCLIP (Contrastive Language-Image Pre-training), an enhanced version of the multimodal AI system that learns visual concepts from natural language in a self-supervised way. OpenCLIP was released by LAION in three versions in mid-September and is now implemented in Stable Diffusion. Stability AI supported the formation of OpenCLIP. CLIP models can compute representations of images and text as embeddings and compare their similarity. This way, an AI system can generate an image that matches text.
Thanks to this new text encoder, Stable Diffusion 2.0 can generate significantly better images than version 1.0, according to Stability AI. The model can generate images with resolutions of 512512 and 769768 pixels, which are then scaled to 20482048 pixels by an upscaler broadcast model which is also new. The new Open CLIP model was trained with a qualitative dataset compiled by Stability AI based on the LAION-5B dataset. Sexual and pornographic content has been filtered out beforehand.
The depth-image model
Also new is a depth-image model that analyzes the depth of an input image, then uses textual input to transform it into new patterns with the outlines of the original image. Version 2.0 of Stable Diffusion also comes with an inpainting template that can be used to replace individual elements of an image in an existing image, for example to paint a cap or a VR headset on the head.
We’ve already seen that when millions of people get their hands on these models, they collectively create some truly amazing things. That’s the power of open source: harnessing the vast potential of millions of talented people who may not have the resources to train a cutting-edge model, but who have the ability to do something amazing. with a model, AI Stability.
Stability.ai was born to create not just research models that never make it into the hands of the majority, but tools with real-world applications open to users. This is a change from other tech companies like OpenAI, which jealously guards the secrets of its best systems (GPT-3 and DALL-E 2), or Google, which never intended to release its own systems (PaLM, LaMDA, Imagen or Parti) as private betas.
Stable Diffusion: the limits of the model
Despite the many improvements, and although Stable Diffusion has state-of-the-art capabilities, there are still situations where it will be inferior to others for certain tasks. Version 2.0 of Stable Diffusion should still work locally on a single powerful graphics card with sufficient memory.
Like many image generation frameworks, Stable Diffusion has limitations created by a number of factors, including the natural limits of an image dataset during training, biases introduced by developers on those images , and blockers built into the model to prevent misuse.
Training set limits
The training data used for an imaging framework will always have a significant impact on the scope of its capabilities. Even when working with big data, like the LAION 2B(en) dataset used for training Stable Diffusion, it is possible to confuse the model by referring to unknown image types with the input prompt. . Features that are not included in the initial training stage will be impossible to recreate, as the model has no understanding of those features. Figures and human faces are the clearest example of this. The model would not have been trained with the aim of refining these characteristics in the generated results.
Bias introduced by researchers
The researchers behind Stable Diffusion recognized the effect of social bias inherent in such a task. Mainly, Stable Diffusion v1 was trained on subsets of the LAION-2B(en) dataset. These data are almost entirely in the English language. They assume that texts and images from non-English speaking cultures and communities would be largely ignored.
This choice to focus on this English-language data allows for a stronger connection between the English-language guests and the results, but simultaneously affects the results by forcing Anglo-Saxon cultural influences to be dominant features in the images produced. Similarly, the ability to use non-English language input prompts is inhibited by this training paradigm.
Source: AI Stability
What do you think of version 2 of Stable Diffusion?
Do you see in this model the capacity to dethrone the models of Open AI GPT-3 and DALL-E 2?
See as well :
Stable Diffusion of Stability AI would be the most important AI model ever, unlike GPT-3 and DALL-E 2, it brings open real-world applications for users
DALL-E Mini is said to be the internet’s favorite AI meme machine, the image-generating app helps understand how AI can distort reality