Creating new biology using AI

Protein design tools use the language of amino acids to train LLMs

25 July 2023

Interview with

Ali Madani, Profluent

Part of the show How AI will actually change the world

COMPUTER_NETWORK

Credit:

CC0, via Pixabay

Play Download

And now, the dizzying possibilities were we to get that right. How’s this for a radical idea: training language models on the vocabulary of genetics to understand the structures of the proteins that make our cellular clockwork tick. Let me explain.

Proteins are polymers assembled from chemical building blocks called amino acids. Different amino acids with different chemical characteristics produce proteins with different shapes and functions, whether it’s enzymes to digest your dinner through to muscle fibres to enable you lift weights or run a marathon.

But when it comes to designing proteins from scratch, for instance to make new drugs like an antibody, or proteins that can be used to make packing cases or even pesticides, working out which amino acids to include and in what order to get the structure and function we want has been an impossibly big problem.

But now step forward Ali Madani, the CEO of Profluent, who are bringing AI to bear on the problem. He outlined the vision to James Tytko.

Ali - The space of available proteins that we could sample is exponentially and mind bogglingly large. An average protein has something called an amino acid that's strung together. These are building blocks, like Legos, that form a sequence. An average length protein will have on the order of 300 to 400 of these Lego building blocks. And for each one of these components, there are 20 different design options. And just to put that into perspective, if you were to take the total number of grains of sand on earth and the total number of humans that have ever lived on earth throughout human history and multiply that by the total number of hypothetical atoms that exist within the universe, that still pales in comparison to the combinatorial space that exists within possible proteins altogether. What we essentially have done for the variety of problems that we have in front of us, whether therapeutics or diagnostics or for industrial applications, is we've relied on finding needles in the haystack of nature, basically finding machines that have already evolved in nature to repurpose them, copy and paste them effectively for problems that we have in human health or otherwise. The promise of machine learning here is that we can actually take control as humans; be able to design from the bottom up novel proteins and not have to essentially rely on searching within this massive haystack basically, and be able to really build the solutions for the most pressing problems we face on the planet, whether it's human health problems to issues of sustainability and the environment.

James - What is it that AI can offer to help achieve this? What does the technology you hope to develop have in common with the AI systems people are more familiar with; the chat bots. Is it that the language of proteins, of biology, really resembles our human languages in such a way where the technology we're more familiar with can be useful?

Ali - That's a fantastic question. What's really amazing here specifically is there is almost a unification with respect to a lot of these techniques that we've been developing from a sequence modelling perspective that can be applied to many different domains, whether it's applied to natural languages like English, to programming languages like Python and CPlusPlus, the language of biology, proteins and DNA as well. Some of the fundamental premises that have enabled this are from a modelling perspective. Advances in modelling architectures and attention mechanisms and also the availability of data. And that really stressed the latter portion: having a rich information source that we can use these flexible machine learning models to learn and uncover patterns that exist within the data and really learn underlying principles, whether it's in natural language that correlates to grammatical structure and semantics or within biology or proteins that correlates to biophysical principles such as structural elements or binding sites or other types of principles from a physics perspective. And what's really powerful here is that they can ingest large amounts of data to, in a data-driven way, uncover those principles.

James - Throughout the programme, we've been hearing how the source of many prominent AI models' power is also their greatest weakness. That while the huge amounts of data they're trained on allows them to come out with, for example, human-like utterances, it also means that dangerous biases and misinformation you find all over the internet slip through its net. So when you are creating a protein designing tool, how do you make sure you are filtering the training data and nothing dodgy makes it into your output?

Ali - I think curation and alignment is a central problem that many of us are facing across different data domains and it applies also to protein design as well. We've had millions of years of evolution and so many different pockets of protein space that have evolved over time for varying different functions, some of which may be completely unrelated to a problem that you have in mind, for example. And there could be lots of noise and data as well. In essence, thinking about it from the perspective of what we aggregate, all of the available biological data that the world's researchers have collected on proteins that exist within the world, there's tons of noise within that as well. Being able to curate this effectively, to align the dataset for a given functional prediction task or functional generation task one has in mind and is desired, that's a challenging problem and something that we think very deeply about at Profluent and within the academic community as well.

James - As exciting as this all is, it sounds very, very complicated. What are the major bottlenecks as you see it?

Ali - Going along the same lines of alignment of datasets, specifically for given functional prediction tasks. We may also have not just sequence information, but we also have structural information and then information that's gathered from the wet lab laboratory experiments specifically. And how to incorporate and basically unify these different modes of data is going to be one challenge that comes to mind. Another challenge as well really comes down again to the wet lab. So similar to what we've seen in natural language processing where we're using human feedback in particular, how to have a tight coupling between the modelling effort that we do on the publicly available data sources and the work that we do specifically in the wet lab for a given problem that we're trying to solve for in particular, and utilising that wet lab data in an effective manner. That's going to be another challenge and limitation of the techniques.

James - So what are the first proteins we might expect that will be worked on using this method?

Ali - One is within the area of antibodies. Antibodies are proteins specifically that are very effective binders. So for example, with respect to Covid, we have antibodies that bind and neutralise covid whenever it gets introduced into our body. Being able to effectively generate antibodies and design them for multiple properties that come to mind, not just binding affinity, but also looking into how the immune response will be within our human bodies. Having multiple of these parameters that come to mind, being able to generate sequences that could work well in one or two rounds of design processes. I think that's going to be one big area that's going/that is already being revolutionised.