The new text-to-image AI developed by Chinese internet company Baidu can generate images showing Chinese objects and celebrities more accurately than existing AIs. But a built-in censorship mechanism filters out politically sensitive words. Called ERNIE-ViLG, Baidu’s AI refuses politically related entries like Tiananmen Square, the country’s second largest square and a symbolic political hub. instead, Baidu’s AI model returns a notification that the user’s input does not meet the relevant rules.
Image synthesis has proven popular (and very controversial) recently on social media and in online art communities. Tools like Stable Diffusion and DALL-E 2 allow users to create images of almost anything they can imagine by entering a textual description called a “prompt”. In 2021, the Chinese technology company Baidu developed its own image synthesis model called ERNIE-ViLG. With ERNIE-ViLG you can generate images that capture the cultural specificity of China. It would produce better images than DALL-E 2 or other Western AIs of the same type.
When a demo version of the software was released in late August, users quickly found that certain words, whether explicitly mentioning the names of political leaders or potentially controversial words only in a political context, were qualified as “sensitive” and prevented from generating any result. It would seem that China’s sophisticated online censorship system has caught up with the latest trend in AI. According to a test report published by the MIT Technology Review, it is not uncommon for similar AIs to block users from generating certain types of content.
DALL-E 2 prohibits sexual content, faces of public figures or images of medical treatment. But the case of ERNIE-ViLG highlights the question of where exactly the line lies between moderation and political censorship. According to technical data provided by Baidu, the ERNIE-ViLG model is part of Wenxin, a large-scale natural language processing (NLP) project conducted by Baidu. It was trained on a dataset of 145 million image-text pairs and contains 10 billion parameters. ERNIE-ViLG has a smaller training dataset than DALL-E 2 (650 million pairs) and Stable Diffusion (2.3 billion pairs).
But it contains more parameters than these (DALL-E 2 has 3.5 billion parameters and Stable Diffusion has 890 million). The report says the main difference between ERNIE-ViLG and Western models is that the one developed by Baidu understands prompts written in Chinese and is less likely to make mistakes when it comes to culturally specific words. For example, after the model was released in August, a Chinese video creator compared the results of different models for guests including Chinese historical figures, pop culture celebrities, and food.
He found that ERNIE-ViLG produced sharper images than DALL-E 2 or Stable Diffusion. After its release, ERNIE-ViLG was also embraced by members of the Japanese animation community. They found that the AI could generate more satisfying anime designs than the other models, probably because it included more anime in its training data. But unlike DALL-E 2 (developed by OpenAI) or Stable Diffusion, ERNIE-ViLG didn’t explain its content moderation policy, and Baidu declined to comment for this story.
When the ERNIE-ViLG demo was first posted on Hugging Face, an international AI community, users who entered certain words received the message “sensitive words found. Please re-enter”, which was a surprisingly honest confession about the filtering mechanism. However, since at least September 12, the message reads: “The content entered does not meet the relevant rules. Please try again after adjusting it”. The tests revealed that several Chinese words were blocked, including the names of prominent Chinese political leaders such as Xi Jinping and Mao Zedong.
Other terms that could be considered politically sensitive were also blocked, including: “revolution” and “climbing the walls” (a metaphor for using a VPN service in China) and the name of the founder and CEO of Baidu , Yanhong (Robin) Li. If words like “democracy” and “government” are allowed, prompts that combine them with other words, like “democracy Middle East” or “UK government”, are blocked. Beijing’s Tiananmen Square also does not feature in ERNIE-ViLG, possibly due to its association with the Tiananmen Massacre, a story heavily censored in China.
China is not alone in facing restrictions on synthesis images, even if, until now, these restrictions have taken a different form from state censorship. In the case of DALL-E 2, OpenAI’s content policy restricts certain forms of content such as nudity, violence, and political content. But this is a voluntary choice by OpenAI, not government pressure. Midjourney also voluntarily filters certain content by keyword. Stable Diffusion, from Stability AI, based in London, comes with a built-in “security filter” which can be disabled due to its open source nature.
So almost anything is possible with this model, depending on where you run it. The head of Stability AI, Emad Mostaque, said in particular that he wanted to avoid censorship of computer-generated imagery models by governments or companies. I think people should be free to do what they think is best to create these models and services,” he wrote in a response to an AMA on Reddit last week. It’s unclear whether Baidu is censoring its ERNIE-ViLG model voluntarily to avoid potential issues with the Chinese government or in response to potential regulation.
In the latter case, the integrated ERNIE-ViLG censorship could correspond to a government regulation concerning deepfakes proposed at the beginning of the year. In January, the Chinese government proposed new regulations prohibiting any form of AI-generated content that “endangers national security and social stability”, which would cover AIs like ERNIE-ViLG. Critics say what might be helpful in the case of ERNIE-ViLG is for Baidu to publish a document explaining moderation decisions.
Despite the built-in censorship, Baidu’s ERNIE-ViLG model is expected to remain an important player in the development of large-scale image synthesis systems. The emergence of AI models trained on specific linguistic datasets compensates for some of the limitations of common English-based models. It will especially help users who need an AI that understands the Chinese language and can generate accurate images accordingly.
Just as Chinese social media platforms have thrived in the face of harsh censorship, ERNIE-ViLG and other Chinese AI models may end up having the same experience: they are too useful to be abandoned. In today’s China, social media companies usually have proprietary lists of sensitive words, based on government instructions and their own operational decisions. This means that the filter used by ERNIE-ViLG may differ from that used by WeChat, owned by Tencent, or Weibo, which is operated by Sina Corporation.
Sources: ERNIE-ViLG (1, 2), the Wenxin project
What is your opinion on the subject?
What do you think of Baidu’s ERNIE-ViLG AI model?
What do you think of the censorship integrated into the ERNIE-ViLG model?
In your opinion, is this a good or a bad thing? Why ?
See as well
Stable Diffusion of Stability AI would be the most important AI model ever, unlike GPT-3 and DALL-E 2, it brings open real-world applications for users
Online art communities are beginning to ban AI-generated images and ban image synthesis as part of a broader debate over art ethics
An AI-generated artwork has won first place in a fine art competition at a state fair, and artists are furious
Record label Capitol Records is forced to abandon its virtual rapper, FN Meka, created by an artificial intelligence, following negative reactions due to “gross stereotypes”
AI researchers are improving how to remove gender bias in systems designed to understand and respond to text or voice data, study finds