Meta AI and Papers with Code, an autonomous team within Meta AI Research, on Nov. 15 presented Galactica, a 120 billion-parameter open-source language model trained on a large corpus that can store, combine, and reason on scientific knowledge. The objective is to help find useful information in the mass of available information. This announcement has already sparked controversy within the scientific community.
Galactica was trained on a corpus of over 360 million contextual citations and over 50 million unique references normalized across a diverse set of sources, enabling it to suggest citations and help discover related articles. Among these sources is NatureBook, a new set of quality scientific data that allowed him to be trained with scientific terminology, mathematics and chemical formulas as well as source codes.
Manage the plethora of scientific information
Information overload is a major obstacle to scientific progress. Researchers are thus buried under a mass of articles, have trouble finding the information useful to their research.
Galactica is a large-scale language model (LLM) trained on over 48 million articles, textbooks, reference documents, compounds, proteins, and other sources of scientific knowledge. It can be used by academic researchers to explore literature, ask scientific questions, write scientific code…
The dataset used was created by tokenizing information from various scientific sources. For the interface, the team used task-specific tokens to support different types of knowledge. It processed citations with a special token, which allows a researcher to predict a citation based on any input context.
Step-by-step reasoning has also been wrapped in a special token, which mimics an internal working memory.
Galactica has achieved very good results in many scientific fields.
In tests of technical knowledge such as LaTeX equations, Galactica outperformed the latest GPT-3 by 68.2% to 49.0%. It also demonstrated strong performance in reasoning, outperforming Chinchilla on Math MMLU with a score of 41.3% vs. 35.7%, and PaLM 540B on MATH with 20.4% vs. 8.8%.
It also sets a new state of the art on downstream tasks such as PubMedQA and MedMCQA of 77.6% and 52.9%. And although it was not trained on a general corpus, Galactica outperforms BLOOM and OPT-175B on BIG-bench.
For the researchers, these results demonstrate the potential of language models as a new interface for science. They released the model as open source for the benefit of the scientific community.
On the Galactica site, it is reminded that there is no guarantee of truthful or reliable output of the linguistic models, and that before following their advice, it is important to carry out checks: “Some of the text generated by Galactica may seem very authentic and very confident, but may be subtly fake in many ways. This is especially the case for highly technical content.”
Galactica should be seen as a writing aid, as Yann Le Cun noted on Twitter:
“This tool is to writing on paper what driving assistance is to driving. It won’t automatically write articles for you, but it will significantly reduce your cognitive load while you write them”.
Gary Marcus, expert scientist in AI, Michael Black, Director of the Max Planck Institute, however, reacted on Twitter and warned that false information generated by Galactica could be taken up in scientific submissions and mislead.
Meta AI and Papers with Code have yet to comment, but they have disabled the Galactica site’s demo feature.
Sources of the article:
“Galactica: a great language model for science”
arXiv:2211.09085v1, November 16, 2022
Ross Taylor, Marcin Kardas, Guillem Cucurull, Thomas Scialom, Anthony Hartshorn, Elvis Saravia, Andrew Poulton, Viktor Kerkez, Robert Stojnic.