| Variational Inference | Deep Generative Models | Diffusion Models | Monte Carlo Methods | Probabilistic Circuits | Gaussian Processes | Causal Inference | ||
|---|---|---|---|---|---|---|---|---|
| Sequence analysis | DNA sequencing | |||||||
| Sequence assembly | ||||||||
| Genome annotation | ||||||||
| Comput. evol. biology | ||||||||
| Comparative genomics | ||||||||
| Pan genomics | ||||||||
| Genetics of disease | ||||||||
| Analysis of mut. in cancer | ||||||||
| Gene and protein expression | Analysis of gene express. | |||||||
| Analysis of protein expr. | ||||||||
| Analysis of regulation | ||||||||
| Analysis of cell. organization | Microscopy image analysis | |||||||
| Protein localization | ||||||||
| Nucl. organ. of chromatin | ||||||||
| Structural bioinformatics | Amino acid sequence | |||||||
| Homology | ||||||||
| Network and syst. biology | Mol. interaction networks | |||||||
| Biodiversity informatics | ||||||||
| Others | Literature analysis | |||||||
| High-thro. image analysis | ||||||||
| High-thro. single-cell data | ||||||||
| Ontologies and data int. |
| Variational Inference | Deep Generative Models | Diffusion Models | Monte Carlo Methods | Probabilistic Circuits | Gaussian Processes | Causal Inference | ||
|---|---|---|---|---|---|---|---|---|
| Sequence analysis | DNA sequencing | |||||||
| Sequence assembly | ||||||||
| Genome annotation | ||||||||
| Comput. evol. biology | ||||||||
| Comparative genomics | ||||||||
| Pan genomics | ||||||||
| Genetics of disease | ||||||||
| Analysis of mut. in cancer | ||||||||
| Gene and protein expression | Analysis of gene express. | |||||||
| Analysis of protein expr. | ||||||||
| Analysis of regulation | ||||||||
| Analysis of cell. organization | Microscopy image analysis | |||||||
| Protein localization | ||||||||
| Nucl. organ. of chromatin | ||||||||
| Structural bioinformatics | Amino acid sequence | |||||||
| Homology | ||||||||
| Network and syst. biology | Mol. interaction networks | |||||||
| Biodiversity informatics | ||||||||
| Others | Literature analysis | |||||||
| High-thro. image analysis | ||||||||
| High-thro. single-cell data | ||||||||
| Ontologies and data int. |
| Variational Inference | Deep Generative Models | Diffusion Models | Monte Carlo Methods | Probabilistic Circuits | Gaussian Processes | Causal Inference | ||
|---|---|---|---|---|---|---|---|---|
| Sequence analysis | DNA sequencing | |||||||
| Sequence assembly | ||||||||
| Genome annotation | ||||||||
| Comput. evol. biology | ||||||||
| Comparative genomics | ||||||||
| Pan genomics | ||||||||
| Genetics of disease | ||||||||
| Analysis of mut. in cancer | ||||||||
| Gene and protein expression | Analysis of gene express. | |||||||
| Analysis of protein expr. | ||||||||
| Analysis of regulation | ||||||||
| Analysis of cell. organization | Microscopy image analysis | |||||||
| Protein localization | ||||||||
| Nucl. organ. of chromatin | ||||||||
| Structural bioinformatics | Amino acid sequence | |||||||
| Homology | ||||||||
| Network and syst. biology | Mol. interaction networks | |||||||
| Biodiversity informatics | ||||||||
| Others | Literature analysis | |||||||
| High-thro. image analysis | ||||||||
| High-thro. single-cell data | ||||||||
| Ontologies and data int. |
| Variational Inference | Deep Generative Models | Diffusion Models | Monte Carlo Methods | Probabilistic Circuits | Gaussian Processes | Causal Inference | ||
|---|---|---|---|---|---|---|---|---|
| Sequence analysis | DNA sequencing | |||||||
| Sequence assembly | ||||||||
| Genome annotation | ||||||||
| Comput. evol. biology | ||||||||
| Comparative genomics | ||||||||
| Pan genomics | ||||||||
| Genetics of disease | ||||||||
| Analysis of mut. in cancer | ||||||||
| Gene and protein expression | Analysis of gene express. | |||||||
| Analysis of protein expr. | ||||||||
| Analysis of regulation | ||||||||
| Analysis of cell. organization | Microscopy image analysis | |||||||
| Protein localization | ||||||||
| Nucl. organ. of chromatin | ||||||||
| Structural bioinformatics | Amino acid sequence | |||||||
| Homology | ||||||||
| Network and syst. biology | Mol. interaction networks | |||||||
| Biodiversity informatics | ||||||||
| Others | Literature analysis | |||||||
| High-thro. image analysis | ||||||||
| High-thro. single-cell data | ||||||||
| Ontologies and data int. |
We represent the 20 amino acids with letters.
LPICPGGAARCQVTLRDLFDRAVVLSHY...
CPSIVARSNFNVCRLPGTPEA
LCATYTGCIIIPGATCPGDYAN
LPICPGGAARCQVTLRDLFDRAVVLSHYIHNLSSEMFSEFDKRYTHGRGFITKAINSCHTSSLATPEDKEQAQQMNQKDFLSLIVSILRSWNEPLYHLVTEVRGMQEAPEAILSKAVEIEEQTKRLLEGMELIVSQVHPETKENEIYPVWSGLPSLQMADEESRLSAYYNLLHCLRRDSHKIDNYLKLLKCRIIHNNNC
Simple model: assume all positions are independent
\[\begin{align*}P(\boldsymbol A) &= P(A_1, A_2, \ldots, A_L)\\ &\approx P(A_1)P(A_2) \ldots P(A_L)\end{align*}\]Estimation: simple counting
Problem: Z is intractable:
\(Z = \sum_A \exp \left ( \sum_i^L \phi_i(A_i) + \sum_{i \lt j}^L \psi_{i,j}(A_i, A_j)\right)\)But can instead be optimized using a pseudo-likehood.
What does it mean to be a high probability sequence?
High probability \(\approx\) high evolutionary fitness
Does this mean we see a correlation between likelihood and experimentally measured fitness? Yes!
Note the difference in performance between the three models
Does this also been than we can detect disease-related variants using the likelihood?
The different positions in a protein are conditionally independent on the latent variable \(z\).
\[\begin{align*}P(\boldsymbol A) &= \int_Z P(A_1|Z)P(A_2|Z) \ldots P(A_L|Z) P(Z) dZ\end{align*}\]This means we are asking a single latent variable to capture all relevant covariances. That's asking a lot!
Hierarchical VAEs: multiple latent states to facilitate modeling of complex correlations.
How deep can we go?
\(\Rightarrow\) Diffusion modelsMotivating example: Green Fluorescent Protein (GFP)
We thus need a good way to model distributions over functions.
We are not in 1D.
Depending on the representation, we often have hundreds of dimensions.
Questions:In our case, this is a fatal source of distribution shift.
Uncertainty estimates are also not very reasonable
where
\[k_{\text{struct}}(\mathbf{x},\mathbf{x'}) = \sum_{i \in M} \sum_{j \in M'}\lambda k_H(\mathbf{x^i},\mathbf{x'^j})k_p(\mathbf{x^i},\mathbf{x'^j})k_d(\mathbf{x^i},\mathbf{x'^j})\]