Probabilistic Machine Learning in Bioinformatics

Wouter Boomsma,
University of Copenhagen (DIKU)
Center for Basic Machine Learning Research in Life Science (MLLS)

ProbAI, June 2024

Probabilistic ML in Bioinformatics

Sequence analysisDNA sequencing
Sequence assembly
Genome annotation
Comput. evol. biology
Comparative genomics
Pan genomics
Genetics of disease
Analysis of mut. in cancer
Gene and protein expressionAnalysis of gene express.
Analysis of protein expr.
Analysis of regulation
Analysis of cell. organizationMicroscopy image analysis
Protein localization
Nucl. organ. of chromatin
Structural bioinformaticsAmino acid sequence
Homology
Network and syst. biologyMol. interaction networks
Biodiversity informatics
OthersLiterature analysis
High-thro. image analysis
High-thro. single-cell data
Ontologies and data int.

Probabilistic ML in Bioinformatics

Variational InferenceDeep Generative ModelsDiffusion ModelsMonte Carlo MethodsProbabilistic CircuitsGaussian ProcessesCausal Inference
Sequence analysisDNA sequencing
Sequence assembly
Genome annotation
Comput. evol. biology
Comparative genomics
Pan genomics
Genetics of disease
Analysis of mut. in cancer
Gene and protein expressionAnalysis of gene express.
Analysis of protein expr.
Analysis of regulation
Analysis of cell. organizationMicroscopy image analysis
Protein localization
Nucl. organ. of chromatin
Structural bioinformaticsAmino acid sequence
Homology
Network and syst. biologyMol. interaction networks
Biodiversity informatics
OthersLiterature analysis
High-thro. image analysis
High-thro. single-cell data
Ontologies and data int.

Probabilistic ML in Bioinformatics

Variational InferenceDeep Generative ModelsDiffusion ModelsMonte Carlo MethodsProbabilistic CircuitsGaussian ProcessesCausal Inference
Sequence analysisDNA sequencing
Sequence assembly
Genome annotation
Comput. evol. biology
Comparative genomics
Pan genomics
Genetics of disease
Analysis of mut. in cancer
Gene and protein expressionAnalysis of gene express.
Analysis of protein expr.
Analysis of regulation
Analysis of cell. organizationMicroscopy image analysis
Protein localization
Nucl. organ. of chromatin
Structural bioinformaticsAmino acid sequence
Homology
Network and syst. biologyMol. interaction networks
Biodiversity informatics
OthersLiterature analysis
High-thro. image analysis
High-thro. single-cell data
Ontologies and data int.

Probabilistic ML in Bioinformatics

Variational InferenceDeep Generative ModelsDiffusion ModelsMonte Carlo MethodsProbabilistic CircuitsGaussian ProcessesCausal Inference
Sequence analysisDNA sequencing
Sequence assembly
Genome annotation
Comput. evol. biology
Comparative genomics
Pan genomics
Genetics of disease
Analysis of mut. in cancer
Gene and protein expressionAnalysis of gene express.
Analysis of protein expr.
Analysis of regulation
Analysis of cell. organizationMicroscopy image analysis
Protein localization
Nucl. organ. of chromatin
Structural bioinformaticsAmino acid sequence
Homology
Network and syst. biologyMol. interaction networks
Biodiversity informatics
OthersLiterature analysis
High-thro. image analysis
High-thro. single-cell data
Ontologies and data int.

Bio-tidbit 1: Proteins are chains of amino acids

Source: https://biopharmaspec.com/

We represent the 20 amino acids with letters.

LPICPGGAARCQVTLRDLFDRAVVLSHY...

Bio-tidbit 2: sequence encodes structure

CPSIVARSNFNVCRLPGTPEA LCATYTGCIIIPGATCPGDYAN
\(\rightarrow\)
`

Bio-tidbit 3: related through evolution

Source: PDB: Molecule of the month - Globin evolution

Part 1

Probabilistic modeling of sequences of amino acids

Modelling protein sequences

How do we model the distribution of amino acid sequences observed in nature? \[P(\boldsymbol A) = P(A_1, A_2, A_3, \ldots, A_L)\]
LPICPGGAARCQVTLRDLFDRAVVLSHYIHNLSSEMFSEFDKRYTHGRGFITKAINSCHTSSLATPEDKEQAQQMNQKDFLSLIVSILRSWNEPLYHLVTEVRGMQEAPEAILSKAVEIEEQTKRLLEGMELIVSQVHPETKENEIYPVWSGLPSLQMADEESRLSAYYNLLHCLRRDSHKIDNYLKLLKCRIIHNNNC

Trick: aligning sequences

Source: https://tcoffee.org

Aligned sequences: factorising by position

Source: https://tcoffee.org

Simple model: assume all positions are independent

\[\begin{align*}P(\boldsymbol A) &= P(A_1, A_2, \ldots, A_L)\\ &\approx P(A_1)P(A_2) \ldots P(A_L)\end{align*}\]

Estimation: simple counting

Are correlations between sites important?

Locally?
Yes. correlations between neighboring amino acids encode secondary structure
Globally?
Source: Marks, Hopf, Sanders, Nature biotechnology, 2012

A better probabilistic model

Insist on getting the column-wise frequencies AND the pairwise frequencies right \[\begin{align*}P(\boldsymbol A) &= P(A_1, A_2, \ldots, A_L)\\ &=\frac{1}{Z} \exp \left ( \sum_i^L \phi_i(A_i) + \sum_{i \lt j}^L \psi_{i,j}(A_i, A_j)\right)\end{align*}\] This is an energy-based model.

Problem: Z is intractable:

\(Z = \sum_A \exp \left ( \sum_i^L \phi_i(A_i) + \sum_{i \lt j}^L \psi_{i,j}(A_i, A_j)\right)\)

But can instead be optimized using a pseudo-likehood.

Balakrishnan, Kamisetty, Carbonell, Lee, Langmead, Proteins: Structure, Function, and Bioinformatics, 2011

The importance of good modelling

\[\begin{align*}P(\boldsymbol A) =\frac{1}{Z} \exp \left ( \sum_i^L \phi_i(A_i) + \sum_{i \lt j}^L \psi_{i,j}(A_i, A_j)\right)\end{align*}\]
Source: Marks, Hopf, Sanders, Nature biotechnology, 2012

Can we do even better?

Source: Riesselman, Ingraham, Marks, Nature methods, 2018

Capturing correlations with a VAE

Source: Frazer, Notin, ..., Gal, Marks, Nature, 2021

Are likelihoods useful?

What does it mean to be a high probability sequence?

High probability \(\approx\) high evolutionary fitness

Does this mean we see a correlation between likelihood and experimentally measured fitness? Yes!

Note the difference in performance between the three models

Source: Riesselman, Ingraham, Marks, Nature methods, 2018

Are likelihoods useful?

Does this also been than we can detect disease-related variants using the likelihood?

Frazer J, Notin P, Dias M, Gomez A, Min JK, Brock K, Gal Y, Marks DS. Nature. 2021 Nov;599(7883):91-5.

Perspective 1: Hierarchical models (1)

The different positions in a protein are conditionally independent on the latent variable \(z\).

\[\begin{align*}P(\boldsymbol A) &= \int_Z P(A_1|Z)P(A_2|Z) \ldots P(A_L|Z) P(Z) dZ\end{align*}\]

This means we are asking a single latent variable to capture all relevant covariances. That's asking a lot!

Perspective 1: Hierarchical models (2)

Source: Marloes Arts

Hierarchical VAEs: multiple latent states to facilitate modeling of complex correlations.

How deep can we go?

\(\Rightarrow\) Diffusion models
Exciting recent results:
Alamdari, Thakkar, van den Berg, Lu, Fusi, Amini, Yang, Protein generation with evolutionary diffusion: sequence is all you need. bioRxiv, 2023

Perspective 2: the Path to AlphaFold

Senior, ..., Hassabis, Improved protein structure prediction using potentials from deep learning, Nature, 2020.

Perspective 2: The Path to AlphaFold

Jumper, ..., Hassabis, Highly accurate protein structure prediction with AlphaFold, Nature, 2021.

Part 2

Optimization of proteins

Can we use prob. ML to improve proteins?

Can we use prob. ML to improve proteins?

Motivating example: Green Fluorescent Protein (GFP)

Source: Rodriguez, Campbell, ..., Tsien, Trends in biochemical sciences, 2017
Hunt, Scherrer, Ferrari, Matz. PloS one, 2010

Protein engineering: experimental setup

Yang, Wu, Arnold, Nature Methods, 2019

Bayesian Optimization

We thus need a good way to model distributions over functions.

Source: https://towardsdatascience.com/shallow-understanding-on-bayesian-optimization-324b6c1f7083

Gaussian Processes

Source: http://smlbook.org/GP/. Code written by Johan Wågberg, 2019.

Challenges for the protein case

We are not in 1D.

Depending on the representation, we often have hundreds of dimensions.

Questions:
  • Can we fit surrogate models with reasonable levels of accuracy?
  • Do we get reasonable uncertainties?

Trying on real data: Illustrative case

Trying on real data: The problem

Commonly employed strategy:
  1. Run first batch of experiments
  2. Construct next batch from combinations of "greatest hits" from batch 1
    ...

In our case, this is a fatal source of distribution shift.

Trying on real data: Not looking so great

Observations:
  • Large fluctuations from week to week.
  • No "learning" over the course of the campaign.

Trying on real data: Uncertainty?

Uncertainty estimates are also not very reasonable

Designing a new kernel

Composite kernel: \[\begin{align*}k(\mathbf{x},\mathbf{x'}) = \pi k_{\text{struct}}(\mathbf{x},\mathbf{x'}) + (1-\pi)k_{\text{seq}}(\mathbf{x},\mathbf{x'})\end{align*} \]

where

\[k_{\text{struct}}(\mathbf{x},\mathbf{x'}) = \sum_{i \in M} \sum_{j \in M'}\lambda k_H(\mathbf{x^i},\mathbf{x'^j})k_p(\mathbf{x^i},\mathbf{x'^j})k_d(\mathbf{x^i},\mathbf{x'^j})\]
Groth, Kerrn, Olsen, Salomon, Boomsma, Kermut: Composite kernel regression for protein variant effects. bioRxiv, 2024

Designing a new kernel: SOTA performance

Groth, Kerrn, Olsen, Salomon, Boomsma, Kermut: Composite kernel regression for protein variant effects. bioRxiv, 2024

Designing a new kernel: Reasonable uncertainties

Groth, Kerrn, Olsen, Salomon, Boomsma, Kermut: Composite kernel regression for protein variant effects. bioRxiv, 2024

Take aways

Overall:
  • The topics taught here are relevant 😊
Part 1:
  • Modeling choices are important. We are still making progress in decade old problems.
Part 2:
  • Gaussian Processes are useful - not only academically interesting.
  • Bayesian Optimization is a compelling idea - but requires good surrogates, which can be difficult in high dimensions