Protein modeling and design
Since AlphaFold2 opened up the research avenue to study protein structures in 2021. Increasing efforts have been made to understand protein interactions. Among them, RosettaFold-AA, AlphaFold-latest expand beyond protein structure prediction to predicting protein interactions with other biomolecules such as small molecules, proteins, nucleic acids, etc. Another line of work studies the conformational space of proteins, Hannah et al. discover that by clustering the multiple-sequence-alignment (MSA) using sequence similarity enables AlphaFold2 to sample alternative protein conformations.
Overview of RoseTTAFold ALL-Atom which extends previous protein structure prediction task to a variety of interaction prediction module. Figure is taken from the paper (RoseTTAFold ALL-Atom).
Complement to understanding the structure and function of proteins, protein design focuses on creating new proteins or modifying existing ones to achieve specific structures and functions. In 2023, two major progress have been made in improving protein design with AI. Following the progress made in the field of Geometric Deep Learning and Generative AI, especially diffusion models, RFDiffusion and Chroma devise diffusion models respecting symmetries in Euclidean space (rotation, translation and reflection) to generate new proteins. In addition to de novo design, they also propose techniques for flexible design and optimization of proteins such as conditioning on binding target, functional motif, and optimizing structures or functions based on a model to provide heuristics (i.e., gradients).
Overview of RFDiffusion model, it learns a denoising process from Gaussian noise to protein structures. Figure is taken from the paper (RFDiffusion).
Foundation models for biology
The “foundation model” paradigm has demonstrated its effectiveness in natural language processing and vision. The natural question to ask is whether this paradigm can be effectively realized in biology. Indeed, in the past, most machine learning methods in biology are task-specific, and models are trained from scratch. Are there foundational and transferrable information that are generalized across tasks? How do we develop models for them?
In the past year, there have been many efforts to make foundation models for various biological modalities. In 2023, we have seen updates on the protein language model and its numerous applications, including protein folding, cross-species cell embedding (UCE), just to name a few. Beyond proteins, we see similar self-supervised learning ideas applied to DNA (1, 2). They are shown to be effective in standard DNA sequence modeling benchmarks. The special challenge of DNA is the massive sequence length, where it easily goes beyond the capacity of the transformer model. HyenaDNA inherits the Hyena framework and extends it to the DNA sequence for long-context foundation models. In addition to DNA, the RNA foundation model has also emerged. The idea of ATOM-1 is to train the seq2seq model on chemical mapping data of RNA sequence. They have demonstrated its initial utility in structure prediction.
Moving to cell biology, in 2023, we see a large number of single cell foundation models. Cui et al., Hao et al., and Theodoris et al. extend LLM pre-training objective to single-cell gene expressions. They are benchmarked on standard single-cell analysis tasks such as batch effect correction and integration. Biology-inspired pre-training objectives have shown the much stronger ability to capture biological signals, and its zero-shot ability has performance on par with finetuned models on integration tasks by Rosen et al.
Lastly, foundation models in biology have predominantly been unimodal (focused on proteins, molecules, diseases, etc.), primarily due to the scarcity of paired data. Bridging modalities to answer multi-modal queries is an exciting frontier. BioBridge leverages biological knowledge graphs to learn transformations across unimodal foundation models, enabling multi-modal behaviors.
While there is excitement around these models, we have not observed a foundation model that can consistently outperform across tasks.
Illustration of foundation model for single-cell genomics and how it enables various downstream tasks. Figure is taken from scGPT.
UCE (Yanay et al.) is a single cell foundation model with zero-shot ability. Figure is taken from the paper (UCE).
LLMs for biology
Biology, in its natural format, is of biochemical entities such as DNA, RNA, and proteins, which have different inductive biases than human language. However, the exploration and understanding of biology heavily rely on human language. For example, scientists write, communicate, and come up with ideas that are described using human language. Experiments are conducted following text descriptions. In 2023, with the advent of powerful LLMs, we have seen the potential for LLMs to revolutionize biological science by introducing novel abilities to study biology and even discover new biology.
Microsoft Research AI4Science team conducted a comprehensive study on LLM in biology. It first benchmarked LLMs across numerous biological tasks such as sequence annotation, identifying functional domains in proteins, signaling pathways, and designing sequences, although with results that are suboptimal with the task-specific model. They also show LLM’s ability to help experiment design by processing data and producing code for liquid handling robots.
Another example of accessing LLM internal knowledge is MedPaLM-2, which utilizes the knowledge contained in LLM about each gene to prioritize genes. Another use case of MedPaLM-2 in clinical medicine is detailed in the Medicine section.
Another exciting thread is the LLM agent framework for autonomous discovery. WikiCrow from Future House is the first of its kind to go through vast public literature and synthesize cited Wikipedia-style summaries to generate draft articles for the 15,616 human protein-coding genes that currently lack Wikipedia articles. LLM agents can serve as a powerful information extractor from sources that surpass the ability and scale of human manual curations. We expect to see more of the agent’s ability in biology, as a research assistant, as a reasoning machine, and more.
Wikicrow LLM agent framework to question answering in biology. Figure is taken from the introduction page.
Graph AI for biology
Biology is an interconnected, multi-scale, and multi-modal system. Effective modeling of this system can not only unravel fundamental biological questions but also significantly impact therapeutic discovery. The most natural data format for encapsulating this system is a relational database or a heterogeneous graph. This graph stores data from decades of wet lab experiments across various biological modalities, scaling up to billions of data points.
In 2023, we witnessed a range of innovative applications using GNNs on these biological system graphs. These applications have unlocked new biomedical capabilities and answered critical biological queries.
One particularly exciting field is perturbative biology. Understanding the outcomes of perturbations can lead to advancements in cell reprogramming, target discovery, and synthetic lethality, among others. In 2023, GEARS applied GNN to gene perturbational relational graphs, and it predicts outcomes of genetic perturbations that have not been observed before.
Another notable application is to contextualize protein representation. While current protein representations are fixed and static, we recognize that the same protein can exhibit different functions in varying cellular contexts. PINNACLE uses GNN on protein interaction networks to contextualize protein embeddings. This approach has been shown to enhance 3D structure-based protein representations and outperform existing context-free models in identifying therapeutic targets.
Moving beyond predictions, understanding the underlying mechanisms of biological phenomena is crucial. Graph XAI applied to system graphs is a natural fit for identifying mechanistic pathways. TxGNN, for example, grounds drug-disease relation predictions in the biological system graph, generating multi-hop interpretable paths. These paths rationalize the potential of a drug in treating a specific disease. TxGNN designed visualizations for these interpretations and conducted user studies, proving their decision-making effectiveness for clinicians and biomedical scientists.
GEARS is a graph neural network model to predict perturbation transcriptional outcomes. Figure is taken from the paper (GEARS).
Computer and Mathematical Science