Mark Zuckerberg, CEO of Meta, announced on November 1 that Meta AI has created a model that predicts protein folding 60 times faster than state-of-the-art models: ESM-2, which has 15 billion parameters. The company published it as well as “ESM (Evolutionary Scale Modeling) Metagenomic Atlas”, a database of 617 million metagenomic protein structures, while that of AlphaFold 2, developed by DeepMind, which also predicts the structure three-dimensional dimension of a protein from its amino acid sequence, has “only” 200 million.
Proteins are present in all living cells where they provide essential functions. The rods and cones in our eyes that detect light and allow us to see, the molecular sensors that underpin hearing and our sense of touch, the complex molecular machines that convert sunlight into chemical energy in plants , enzymes that break down plastic, antibodies that protect us against disease are examples of proteins.
The metagenomic proteins that are found in microbes including the soil, the air, the ocean floor and even inside our gut, far outnumber those that make up animal and plant life, but still little understood.
Metagenomics uses genetic sequencing to discover proteins in samples from these complex environments. She has highlighted the incredible breadth and diversity of these proteins, uncovering billions of novel protein sequences, cataloged in large databases compiled by public initiatives such as the NCBI, the European Institute of Bioinformatics and the Joint Genome Institute, integrating studies from a global community of researchers.
According to Meta, ESM Metagenomic Atlas, is the first to cover metagenomic proteins comprehensively and at scale. These structures provide an unprecedented view of the breadth and diversity of nature, and offer the potential for new scientific insights and to accelerate protein discovery for practical applications in fields such as medicine, green chemistry , environmental applications and renewable energies. »
The creation of this “ first overview of the ‘dark matter’ of the protein universe » was made possible through the development of ESM-2, a model for protein folding of Meta AI.
ESM-2, a 15 billion parameter protein language model
In 2019, Meta published a study demonstrating that language models learn the properties of proteins, such as their structure and function. Using a form of self-supervised learning known as masked language modeling, the researchers had trained a language model on the sequences of millions of natural proteins. With this approach, the template should correctly fill in the blanks in a passage of text, such as “To __ or not to __, that is the ____”. They trained a language model to fill in the blanks in a protein sequence, like “GL_KKE_AHY_G” across millions of diverse proteins.
They developed the following year, ESM-1b, a model with approximately 650 million parameters that is used for a variety of applications, including helping scientists predict the course of COVID-19 and discover genetic causes of the disease.
The researchers extended this approach to create a next-generation protein language model, ESM-2, which at 15 billion parameters is the largest protein language model to date. They found that from 8 million parameters, information emerged in the internal representations that predicted 3D structure at atomic resolution.
The ESM-2 neural network helped create the ESM Metagenomic Atlas by predicting 617 million structures from the MGnify90 protein database in just two weeks of operation on a cluster of 2,000 GPUs. Both are expected to accelerate the discovery of new drugs, help fight disease, and develop clean energy.