Publications

Data-optimal scaling of paired antibody language models

bioRxiv

This study explored how to build better computer models that learn from antibody sequence data, which could help in understanding and engineering antibodies for research and medicine. The authors focused on “paired” antibody sequences, where two parts of an antibody are linked together, because these provide richer biological information than single sequences. They trained a series of machine learning models of different sizes on datasets of varying amounts of paired antibody information to see how model performance changed with more data and larger model size. 

From these experiments, they found a specific relationship between the amount of data used and the size of the model that leads to the best performance in this setting. Using this relationship, they estimated that optimally training a 650-million parameter antibody model would require around 5.5 million paired antibody sequences, rather than this being a general benchmark for all high-performing models. They also tested these models on tasks such as classifying antibody types and found that larger models generally performed better, but only when enough data were available. 

The findings suggest that in fields like antibody science, where training data are limited compared with natural language, improving model accuracy depends on scaling both the amount of data and model capacity together. This work provides practical guidance for future development of antibody-specific machine learning models that could be used to study immune responses or design therapeutic antibodies. 

SANTHE is an Africa Health Research Institute (AHRI) flagship programme funded by the Science for Africa Foundation through the DELTAS Africa programme; the Gates Foundation; Gilead Sciences Inc.; and the Ragon Institute of Mass General, MIT, and Harvard.