Chapter 2: Generative AI Fundamentals
Introduction: The Building Blocks of Generation
To harness generative AI for scientific discovery, we must understand how these models workβnot as black boxes, but as computational systems with clear mathematical foundations and architectural principles. This chapter demystifies the core technologies powering modern generative AI: transformers, diffusion models, flow matching, and variational approaches using plain English. Weβll explore their mechanisms, understand when to use each, and see how they can be adapted for scientific applications.
While the mathematics can become complex, our focus is on building intuition. Scientists donβt need to be machine learning experts to use these tools effectively, but understanding the fundamentals will help you choose the right model architecture, diagnose failures, adapt pre-trained models, collaborate with computational researchers, and evaluate limitations.
The Three Pillars of Generative AI
Modern generative AI rests on three major architectural families:
| Architecture | Core Mechanism | Best For | Example Applications |
|---|---|---|---|
| Transformers & LLMs | Self-attention over sequences | Text, code, sequences | Literature synthesis, protein sequences, code generation |
| Diffusion & Flow Models | Iterative denoising / flow matching | Structured outputs, images | Molecular structures, protein folding, climate data |
| VAEs & GANs | Latent space learning | Data generation, interpolation | Synthetic data, anomaly detection, compression |
Part I: Transformers and Large Language Models
The Attention Revolution
Traditional neural networks process sequences step-by-step. Transformers changed this with attention: let every element directly attend to every other element simultaneously [1]. The breakthrough βAttention Is All You Needβ paper introduced self-attention mechanisms that model long-range dependencies in parallel, eliminating the sequential bottleneck of recurrent networks.
The Attention Mechanism
Attention computes three quantities for each sequence element [1]:
| Component | Role | Intuition |
|---|---|---|
| Query (Q) | What am I looking for? | The question each element asks |
| Key (K) | What do I contain? | How each element describes itself |
| Value (V) | What do I contribute? | The information each element provides |
Attention formula:
Attention(Q, K, V) = softmax(QΒ·K^T / βd_k) Β· V
Here, βsoftmaxβ turns raw similarity scores into a meaningful probability distribution that tells the model how much attention to pay to each token.
Why This Works for Science:
- Protein Sequences: Connect distant amino acids that interact in 3D
- Scientific Literature: Link concepts mentioned paragraphs apart
- Chemical Reactions: Identify relationships between reactants and products
The Transformer Architecture
The original encoder-decoder architecture [1] consists of stacked attention and feed-forward layers:
Input β Embedding β Positional Encoding
β
[ Multi-Head Attention β Add & Norm
β
Feed-Forward β Add & Norm ] Γ N layers
β
Output Probabilities
Key Components:
| Component | Purpose | Scientific Benefit |
|---|---|---|
| Multi-Head Attention | Learn different patterns | Capture multiple relationship types |
| Residual Connections | Enable deep networks | Scale to billions of parameters |
| Layer Normalization | Stabilize training | Faster convergence |
| Positional Encoding | Track sequence order | Understand structure (DNA direction, time series) |
From Transformers to Large Language Models
1. Pre-Training: Learning General Patterns
Objective: Predict next token given context.
Input: "The protein binds to the"
Target: "receptor"
The GPT series demonstrated the power of autoregressive language modeling at scale [2, 3], while BERT showed bidirectional pre-training for understanding tasks [4]. More recent models like LLaMA [5], Gemini [6], and Claude push the boundaries with trillions of tokens and multimodal capabilities.
Scientific Pre-Trained Models:
- SciBERT [7]: Scientific papers
- ESM-2 [8]: Protein sequences (750M sequences)
- ESM-3 [9]: Multimodal protein model (sequence, structure, function) with 98B parameters (2024/2025)
- Llama 3.1/3.2/3.3 [10]: Open foundation models with 128K context (2024)
2. Fine-Tuning: Domain Specialization
Full fine-tuning updates all model parameters, but parameter-efficient methods like LoRA [11] enable adaptation with minimal computational cost by learning low-rank updates to weight matrices.
| Method | Trainable Params | Data Needed | Use Case |
|---|---|---|---|
| Full Fine-Tuning | 100% | 10K-100K | Maximum adaptation |
| LoRA [11] | 0.1-1% | 100-10K | Limited compute |
| Adapters [12] | 1-5% | 1K-10K | Task-specific layers |
| QLoRA [13] | 0.1-1% | 100-10K | Quantized + LoRA (fine-tune 70B on consumer GPU) |
3. Prompting: Zero-Shot and Few-Shot
Large language models demonstrate remarkable few-shot learning capabilities [3], enabling scientific applications without task-specific training.
Zero-Shot:
Summarize this oceanography paper: [text]
Few-Shot:
Examples:
SMILES: CCO β Name: Ethanol
SMILES: CC(C)O β Name: Propan-2-ol
SMILES: CCCO β Name: ?
Reasoning Models: A New Paradigm (2024β2025)
A significant development in 2024β2025 is the emergence of reasoning models that βthink before they respondβ [14, 15]. Unlike standard LLMs that generate answers in a single pass, reasoning models like OpenAIβs o1/o3 series and Gemini Deep Think produce an internal chain of thought before responding.
Key Characteristics:
- Extended reasoning time for complex problems
- Multi-step planning and self-verification
- Superior performance on math, coding, and scientific reasoning
- Configurable βreasoning effortβ levels
Scientific Applications:
- Complex mathematical proofs
- Multi-step experimental design
- Code debugging and algorithm implementation
- Hypothesis evaluation requiring logical chains
Limitations for Science
| Challenge | Impact | Mitigation |
|---|---|---|
| Hallucination | False information | RAG [16], verification, reasoning models |
| Lack of Uncertainty | Overconfidence | Ensemble methods [17] |
| Data Cutoff | Outdated knowledge | Fine-tuning, RAG [16] |
| Context Limits | Long document handling | Extended context models (128K+ tokens) |
Part II: Diffusion Models and Flow Matching
The Denoising Paradigm
Diffusion models learn to generate by reversing a noise-addition process [18, 19]. The foundational work by Sohl-Dickstein et al. [18] established the thermodynamic interpretation, while Ho et al. [19] simplified training with the denoising objective.
Forward Process: Adding Noise
xβ (real data) β xβ β xβ β ... β xβ (pure noise)
At each step: xβ = β(1-Ξ²β)Β·xβββ + βΞ²βΒ·Ξ΅ where Ξ΅ ~ N(0,I)
Reverse Process: Learning to Denoise
Train a network to predict the noise [19]:
Loss = E[||Ξ΅ - Ξ΅βα΅£βπΉ(xβ, t)||Β²]
Generation: Sampling
- Start with noise:
xβ ~ N(0, I) - Iteratively denoise:
xβ β xβββ β ... β xβ - Result: novel sample
Flow Matching: A Powerful Alternative (2023β2025)
Flow Matching [20] has emerged as a powerful and efficient alternative to diffusion-based generative modeling, with growing interest in scientific applications [21]. Rather than learning to reverse a noising process, flow matching learns a velocity field that transports samples from noise to data along continuous paths.
Key Advantages:
- Straighter trajectories: Fewer sampling steps required
- Stable training: Simpler objectives, less hyperparameter tuning
- Faster inference: Often 2-4x speedup over diffusion
- Flexible conditioning: Natural incorporation of constraints
Flow Matching in Biology (2024β2025):
- Molecule generation (NeurIPS 2023) [21]
- Protein backbone generation with SE(3)-equivariant flows (ICLR 2024) [21]
- Antibody design with IgFlow and dyAb [21]
- Biological sequence and peptide generation (ICML 2024) [21]
Why Diffusion/Flow Works for Science
Score-based generative modeling [22] provides a continuous-time perspective, while empirical results show diffusion models produce higher-quality samples than GANs [23].
| Feature | Scientific Benefit |
|---|---|
| High Quality | Realistic structures |
| Stable Training | Easier than GANs [23] |
| Interpretable | Visualize generation |
| Conditional | Incorporate constraints [24] |
| Uncertainty | Multiple samples |
Applications:
- Molecular conformations with valency constraints [25]
- Protein structures respecting physics [26]
- Climate data filling gaps while conserving energy
- High-resolution image synthesis with latent diffusion [27]
- Probabilistic weather forecasting with GenCast [28]
Conditional Diffusion
Generate data with specific properties using classifier guidance or classifier-free guidance [24]:
p(xβββ | xβ, condition)
Conditioning Methods:
| Method | Example |
|---|---|
| Classifier Guidance [23] | βMolecule binds to protein Xβ |
| Classifier-Free [24] | Train with/without conditions |
| Inpainting | Fill missing climate data |
Part III: VAEs and GANs
Variational Autoencoders (VAEs)
VAEs learn probabilistic latent representations through variational inference [29]:
Architecture:
Encoder: x β [ΞΌ(x), Ο(x)]
Sample: z ~ N(ΞΌ, ΟΒ²)
Decoder: z β xΜ
Loss:
Total = Reconstruction + KL_Divergence
Scientific Uses:
- Chemistry: Interpolate between molecules [30]
- Materials: Optimize in latent space
- Genomics: Compress gene expression data
Generative Adversarial Networks (GANs)
GANs introduced adversarial training between generator and discriminator [31]:
Two Networks in Competition:
| Network | Role | Goal |
|---|---|---|
| Generator | Create fakes | Fool discriminator |
| Discriminator | Judge real/fake | Detect fakes |
Challenges:
- Mode collapse (limited diversity)
- Training instability
- Difficult evaluation
Scientific Uses:
- Data augmentation
- Super-resolution
- Image-to-image translation
- Molecular graph generation [32]
Comparison
| Criterion | VAE | GAN | Diffusion | Flow Matching |
|---|---|---|---|---|
| Quality | Good | Excellent | Excellent | Excellent |
| Stability | Stable | Unstable | Stable | Very Stable |
| Diversity | Good | Poor | Excellent | Excellent |
| Speed | Fast | Fast | Slow | Moderate |
| Latent Space | Interpretable | Opaque | Implicit | Implicit |
Part IV: Pre-Training and Fine-Tuning
The Transfer Learning Pipeline
Pre-Training (general data, millions of examples)
β
Fine-Tuning (domain data, thousands of examples)
β
Prompting (task-specific, zero examples)
Foundation models [33] trained on massive corpora can be adapted to downstream scientific tasks with limited data.
Parameter-Efficient Fine-Tuning (PEFT)
LoRA Example [11]:
from peft import LoraConfig, get_peft_model
config = LoraConfig(
r=16, # Adaptation rank
lora_alpha=32,
target_modules=["q_proj", "v_proj"]
)
model = get_peft_model(base_model, config)
# Now only 0.5% of parameters are trainable!
QLoRA [13] combines quantization with LoRA, enabling fine-tuning of 70B models on consumer GPUsβa game-changer for scientific labs with limited compute budgets.
Benefits:
- Train on single GPU
- Fast iteration
- Multiple task-specific versions
Open vs. Closed Models for Science
The 2024β2025 landscape offers scientists unprecedented choice:
| Model Family | Access | Parameters | Best For |
|---|---|---|---|
| Llama 3.1 / 3.2 / 3.3 | Open weights | 1Bβ405B (text); 11B / 90B (vision variants) | Local deployment, fine-tuning, on-prem and private customization |
| OpenAI GPT-4/4o/5.1/5.2 | API (and ChatGPT) | Not disclosed | Advanced reasoning, multimodal understanding, coding, tool-using agents |
| Gemini 2.0/3.0 (Pro, Flash) | API (Gemini API, Vertex AI) | Not disclosed | Multimodal tasks, very long context (up to ~1M tokens on supported models) |
| Claude (Claude 3.5, Sonnet/Opus 4.5) | API | Not disclosed | Deep analysis, coding, safety-oriented assistants, long-context workflows |
| **ESM-3 ** | API + open weights (ESM3-open) | 1.4B (open); up to ~98B (largest) | Protein modeling, structure/function prediction, generative protein design |
Part V: Mathematical Foundations
Low-Rank Approximation
Large models use low-rank structures for efficiency [11].
Matrix Factorization:
W β R^(mΓn) β AΒ·B where A β R^(mΓr), B β R^(rΓn), r << min(m,n)
Applications:
- LoRA adaptation [11]
- Model compression
- Fast inference
Quantization
Reduce precision for efficiency [34, 35]:
| Precision | Bits | Range | Use Case |
|---|---|---|---|
| FP32 | 32 | Full | Training |
| FP16/BF16 | 16 | Mixed | Fast training |
| INT8 | 8 | Limited | Inference |
| INT4 | 4 | Very limited | Edge deployment, QLoRA |
LLM.int8() [34] and GPTQ [35] enable accurate quantization without retraining.
Benefits:
- 4Γ smaller models (FP32 β INT8)
- 2-4Γ faster inference
- Run large models on consumer GPUs
Optimization Theory
Key Algorithms:
| Optimizer | Mechanism | When to Use |
|---|---|---|
| SGD | Basic gradient descent | Small models, well-understood problems |
| Adam [36] | Adaptive learning rates | Default choice, most robust |
| AdamW [37] | Adam + weight decay | Large language models |
Learning Rate Schedules:
Warmup β Constant β Decay
- Warmup: Start small, gradually increase (stabilize training)
- Constant: Maintain learning rate (main training)
- Decay: Reduce toward end (fine-tune solution)
Part VI: Types of Generative AI by Modality
Text Generation
Capabilities:
- Scientific writing
- Code generation
- Literature synthesis
- Hypothesis generation
Models: GPT-4o/4.5/o1/o3 [14, 15], Llama 3.1/3.2/3.3/4 [10], Gemini 2.0/3.0 [6], Claude.
Image Generation
Capabilities:
- Synthetic microscopy
- Medical imaging
- Scientific visualization
- Data augmentation
Models: Stable Diffusion 3/3.5 [38], SDXL [27], DALL-E 3
Molecular Generation
Capabilities:
- Drug discovery
- Material design
- Reaction prediction
Representations:
- SMILES strings (text-like)
- Molecular graphs [32]
- 3D conformations [25]
Models: MolGAN [32], Diffusion-based generators [25], VAEs [30], Flow matching models [21]
Protein Generation
Capabilities:
- Structure prediction (AlphaFold 3 [39], ESMFold [8])
- Sequence design
- Function prediction
- Multimodal generation (ESM-3) [9]
Models: ESM-2 [8], ESM-3 [9], ProtGPT2 [40], RFdiffusion [26]
2024β2025 Highlight: ESM-3 [9] ESM-3, published in Science (January 2025), is a 98B parameter multimodal generative model that reasons over protein sequence, structure, and function simultaneously. It generated a novel green fluorescent protein (GFP) with only 58% sequence identity to known GFPsβequivalent to simulating 500 million years of evolution.
Graph Generation
Capabilities:
- Molecular graphs
- Protein-protein interaction networks
- Knowledge graphs
Models: GraphRNN, MolGAN [32], GraphVAE
Multimodal Generation
Capabilities:
- Text β Image (CLIP [41] + diffusion)
- Text β Molecule (MolT5)
- Image β Text (scientific figure captioning)
- Cross-modal retrieval
Models: CLIP [41], GPT-4o/4V [15], Gemini 2.0 [6]
Design Principles for Scientific Applications
1. Incorporate Domain Knowledge
Physics-Informed Neural Networks (PINNs) [42]: Embed differential equations in loss function:
Loss = Data_Loss + λ·Physics_Loss
Physics_Loss = ||βu/βt - Ξ±βΒ²u||Β² (heat equation)
Benefits:
- Better generalization
- Physical consistency
- Data efficiency
2. Quantify Uncertainty
Methods:
- Ensemble [17]: Train multiple models, measure spread
- Bayesian [43]: Maintain distributions over parameters
- Conformal [44]: Finite-sample coverage guarantees
Why Essential: Scientific decisions have consequencesβwe must know when models are uncertain.
3. Ensure Reproducibility
Checklist:
- Set random seeds
- Log hyperparameters
- Version control data and code
- Document environment
- Share trained models
4. Validate Rigorously
| Validation Type | Purpose |
|---|---|
| Hold-out Test | Independent performance |
| Cross-validation | Robust estimates |
| Domain Expert | Scientific plausibility |
| Physical Constraints | Obey natural laws |
| Out-of-distribution | Generalization limits |
Practical Considerations
Model Selection Guide
| Your Goal | Recommended Architecture | Rationale |
|---|---|---|
| Generate scientific text | Transformer LLM [2, 3, 10] | Excellent at language |
| Complex reasoning/math | Reasoning models (o1/o3) [14, 15] | Multi-step verification |
| Design molecules | Diffusion [25] or Flow Matching [21] | High-quality structures |
| Predict protein structure | ESM-3 [9], AlphaFold 3 [39] | State-of-the-art specialized models |
| Fill missing data | Conditional diffusion [24] | Respects constraints |
| Synthetic data augmentation | GAN [31] or VAE [29] | Fast generation |
| Explore design space | VAE [29] | Interpretable latent space |
Computational Requirements
| Task | Typical Resources | Alternative |
|---|---|---|
| Fine-tune small LLM | Single GPU, 1-2 days | Use LoRA/QLoRA [11, 13] for efficiency |
| Train diffusion model | 4-8 GPUs, 3-7 days | Use pre-trained models [27] |
| Inference (LLM) | 1 GPU or CPU | Quantization [34, 35] for speed |
| Protein structure | 1 GPU, minutes | Use hosted APIs |
Common Pitfalls
| Pitfall | Consequence | Solution |
|---|---|---|
| Overfitting | Poor generalization | Regularization, more data |
| Underfitting | Poor performance | Larger model, more training |
| Data leakage | Inflated metrics | Careful splitting |
| Ignoring uncertainty | Overconfident predictions | Uncertainty quantification [17, 43, 44] |
| Black-box use | Unexplainable failures | Understand fundamentals |
Summary
This chapter introduced the core architectures powering generative AI:
Transformers [1] excel at sequential data through attention mechanisms, enabling scientific text generation [2, 3, 10], code synthesis, and protein language models [8, 9]. The emergence of reasoning models [14, 15] in 2024β2025 represents a paradigm shift for complex scientific problem-solving.
Diffusion models [18, 19, 22] and flow matching [20, 21] generate high-quality structured outputs by learning to reverse noise or transport samples along continuous paths, ideal for molecular design [25], protein structures [26], and climate data reconstruction [28].
VAEs [29] and GANs [31] provide complementary approaches for data generation, exploration, and augmentation.
Transfer learning [33]βpre-training followed by fine-tuning [11, 12, 13]βallows scientists to leverage massive general models with modest domain-specific data. Open models like Llama 3/4 [10] democratize access for scientific labs.
Mathematical foundations like low-rank approximation [11], quantization [34, 35], and optimization theory [36, 37] enable efficient training and deployment.
Understanding these fundamentals empowers you to:
- Choose appropriate models for your scientific problems
- Adapt existing models to new domains
- Diagnose and fix failures
- Collaborate effectively with AI researchers
- Push the boundaries of whatβs possible in your field
References
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., et al. (2017). Attention is all you need. Advances in Neural Information Processing Systems, 30. https://arxiv.org/abs/1706.03762
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., & Sutskever, I. (2019). Language models are unsupervised multitask learners (GPT-2). OpenAI Blog. https://openai.com/research/better-language-models
Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., et al. (2020). Language models are few-shot learners (GPT-3). Advances in Neural Information Processing Systems, 33, 1877β1901. https://arxiv.org/abs/2005.14165
Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. Proceedings of NAACL-HLT, 4171β4186. https://arxiv.org/abs/1810.04805
Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M. A., Lacroix, T., et al. (2023). LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971. https://arxiv.org/abs/2302.13971
Gemini Team, Google. (2024). Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv preprint arXiv:2403.05530. https://arxiv.org/abs/2403.05530
Beltagy, I., Lo, K., & Cohan, A. (2019). SciBERT: A pretrained language model for scientific text. Proceedings of EMNLP-IJCNLP, 3615β3620. https://arxiv.org/abs/1903.10676
Lin, Z., Akin, H., Rao, R., Hie, B., Zhu, Z., Lu, W., et al. (2023). Evolutionary-scale prediction of atomic-level protein structure with a language model (ESM-2/ESMFold). Science, 379(6637), 1123β1130. https://doi.org/10.1126/science.ade2574
Hayes, T., Rao, R., Akin, H., Sofroniew, N. J., Oktay, D., Lin, Z., et al. (2025). Simulating 500 million years of evolution with a language model (ESM-3). Science, 387(6736), 850β858. https://doi.org/10.1126/science.ads0018
Meta AI. (2024). Llama 3.1, 3.2, and 3.3: The most capable openly available LLMs. Meta AI Blog. https://ai.meta.com/blog/meta-llama-3-1/
Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., et al. (2022). LoRA: Low-rank adaptation of large language models. Proceedings of ICLR. https://arxiv.org/abs/2106.09685
Houlsby, N., Giurgiu, A., Jastrzebski, S., Morrone, B., De Laroussilhe, Q., Gesmundo, A., et al. (2019). Parameter-efficient transfer learning for NLP. Proceedings of ICML, 2790β2799. https://arxiv.org/abs/1902.00751
Dettmers, T., Pagnoni, A., Holtzman, A., & Zettlemoyer, L. (2024). QLoRA: Efficient finetuning of quantized LLMs. Proceedings of NeurIPS, 36. https://arxiv.org/abs/2305.14314
OpenAI. (2024). Introducing OpenAI o1. OpenAI Blog. https://openai.com/index/introducing-openai-o1-preview/
OpenAI. (2024). GPT-4o and reasoning models. OpenAI Platform. https://platform.openai.com/docs/models
Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., et al. (2020). Retrieval-augmented generation for knowledge-intensive NLP tasks. Proceedings of NeurIPS, 33, 9459β9474. https://arxiv.org/abs/2005.11401
Lakshminarayanan, B., Pritzel, A., & Blundell, C. (2017). Simple and scalable predictive uncertainty estimation using deep ensembles. Proceedings of NeurIPS, 30. https://arxiv.org/abs/1612.01474
Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N., & Ganguli, S. (2015). Deep unsupervised learning using nonequilibrium thermodynamics. Proceedings of ICML, 2256β2265. https://arxiv.org/abs/1503.03585
Ho, J., Jain, A., & Abbeel, P. (2020). Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33, 6840β6851. https://arxiv.org/abs/2006.11239
Lipman, Y., Chen, R. T. Q., Ben-Hamu, H., Nickel, M., & Le, M. (2023). Flow matching for generative modeling. Proceedings of ICLR. https://arxiv.org/abs/2210.02747
Li, Z., Zeng, Z., Lin, X., Fang, F., Qu, Y., Xu, Z., et al. (2025). Flow matching meets biology and life science: A survey. arXiv preprint arXiv:2507.17731. https://arxiv.org/abs/2507.17731
Song, Y., Sohl-Dickstein, J., Kingma, D. P., Kumar, A., Ermon, S., & Poole, B. (2021). Score-based generative modeling through stochastic differential equations. Proceedings of ICLR. https://arxiv.org/abs/2011.13456
Dhariwal, P., & Nichol, A. (2021). Diffusion models beat GANs on image synthesis. Advances in Neural Information Processing Systems, 34, 8780β8794. https://arxiv.org/abs/2105.05233
Ho, J., & Salimans, T. (2022). Classifier-free diffusion guidance. NeurIPS Workshop on Score-Based Methods. https://arxiv.org/abs/2207.12598
Hoogeboom, E., Satorras, V. G., Vignac, C., & Welling, M. (2022). Equivariant diffusion for molecule generation in 3D. Proceedings of ICML, 8867β8887. https://arxiv.org/abs/2203.17003
Watson, J. L., Juergens, D., Bennett, N. R., Trippe, B. L., Yim, J., Eisenach, H. E., et al. (2023). De novo design of protein structure and function with RFdiffusion. Nature, 620(7976), 1089β1100. https://doi.org/10.1038/s41586-023-06415-8
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., & Ommer, B. (2022). High-resolution image synthesis with latent diffusion models (Stable Diffusion). Proceedings of CVPR, 10684β10695. https://arxiv.org/abs/2112.10752
Price, I., Sanchez-Gonzalez, A., Alet, F., Andersson, T. R., El-Kadi, A., Masters, D., et al. (2024). Probabilistic weather forecasting with machine learning (GenCast). Nature, 636, 84β90. https://doi.org/10.1038/s41586-024-08252-9
Kingma, D. P., & Welling, M. (2014). Auto-encoding variational Bayes. Proceedings of ICLR. https://arxiv.org/abs/1312.6114
GΓ³mez-Bombarelli, R., Wei, J. N., Duvenaud, D., HernΓ‘ndez-Lobato, J. M., SΓ‘nchez-Lengeling, B., Sheberla, D., et al. (2018). Automatic chemical design using a data-driven continuous representation of molecules. ACS Central Science, 4(2), 268β276. https://doi.org/10.1021/acscentsci.7b00572
Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., et al. (2014). Generative adversarial networks. Advances in Neural Information Processing Systems, 27. https://arxiv.org/abs/1406.2661
De Cao, N., & Kipf, T. (2018). MolGAN: An implicit generative model for small molecular graphs. ICML Workshop on Theoretical Foundations and Applications of Deep Generative Models. https://arxiv.org/abs/1805.11973
Bommasani, R., Hudson, D. A., Adeli, E., Altman, R., Arora, S., von Arx, S., et al. (2021). On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258. https://arxiv.org/abs/2108.07258
Dettmers, T., Lewis, M., Belkada, Y., & Zettlemoyer, L. (2022). LLM.int8(): 8-bit matrix multiplication for transformers at scale. Proceedings of NeurIPS, 35, 30318β30332. https://arxiv.org/abs/2208.07339
Frantar, E., Ashkboos, S., Hoefler, T., & Alistarh, D. (2023). GPTQ: Accurate post-training quantization for generative pre-trained transformers. Proceedings of ICLR. https://arxiv.org/abs/2210.17323
Kingma, D. P., & Ba, J. (2015). Adam: A method for stochastic optimization. Proceedings of ICLR. https://arxiv.org/abs/1412.6980
Loshchilov, I., & Hutter, F. (2019). Decoupled weight decay regularization (AdamW). Proceedings of ICLR. https://arxiv.org/abs/1711.05101
Stability AI. (2024). Stable Diffusion 3: Scaling rectified flow transformers for high-resolution image synthesis. https://stability.ai/stable-image
Abramson, J., Adler, J., Dunger, J., Evans, R., Green, T., Pritzel, A., et al. (2024). Accurate structure prediction of biomolecular interactions with AlphaFold 3. Nature, 630(8016), 493β500. https://doi.org/10.1038/s41586-024-07487-w
Ferruz, N., Schmidt, S., & HΓΆcker, B. (2022). ProtGPT2 is a deep unsupervised language model for protein design. Nature Communications, 13(1), 4348. https://doi.org/10.1038/s41467-022-32007-7
Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., et al. (2021). Learning transferable visual models from natural language supervision (CLIP). Proceedings of ICML, 8748β8763. https://arxiv.org/abs/2103.00020
Raissi, M., Perdikaris, P., & Karniadakis, G. E. (2019). Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations. Journal of Computational Physics, 378, 686β707. https://doi.org/10.1016/j.jcp.2018.10.045
Gal, Y., & Ghahramani, Z. (2016). Dropout as a Bayesian approximation: Representing model uncertainty in deep learning. Proceedings of ICML, 1050β1059. https://arxiv.org/abs/1506.02142
Angelopoulos, A. N., & Bates, S. (2021). A gentle introduction to conformal prediction and distribution-free uncertainty quantification. arXiv preprint arXiv:2107.07511. https://arxiv.org/abs/2107.07511
Additional Resources
Textbooks
- Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press. http://www.deeplearningbook.org
- Prince, S. J. D. (2023). Understanding Deep Learning. MIT Press. http://udlbook.github.io/udlbook/
- Tunstall, L., von Werra, L., & Wolf, T. (2022). Natural Language Processing with Transformers. OβReilly Media.
- Alammar, J., & van den Oord, A. (2024). Generative Deep Learning (2nd ed.). OβReilly Media. https://www.oreilly.com/library/view/generative-deep-learning/9781098134174/
- Raschka, S. (2024). Build a Large Language Model (From Scratch). Manning. https://www.manning.com/books/build-a-large-language-model-from-scratch
- Holderrieth, P., & Erives, E. (2025). Introduction to Flow Matching and Diffusion Models. MIT Course 6.S184. https://diffusion.csail.mit.edu/
- Liu, J.P. (2025). How to Build and Fine-tune a Small Language Model. Leanpub. https://leanpub.com/howtobuildandfine-tuneasmalllanguagemodel
Key Review Papers
- Yang, L., et al. (2023). Diffusion models: A comprehensive survey. ACM Computing Surveys, 56(4), 1β39.
- Bommasani, R., et al. (2022). On the opportunities and risks of foundation models. https://arxiv.org/abs/2108.07258
- Zhao, W. X., et al. (2023). A survey of large language models. ACM Computing Surveys. https://arxiv.org/abs/2303.18223
- Li, Z., et al. (2025). Flow matching meets biology and life science: A survey. https://arxiv.org/abs/2507.17731
This chapter provides the foundational knowledge needed to understand and apply generative AI in scientific contexts. For the latest developments, regularly check arXiv (cs.LG, cs.CL, q-bio sections) and major conference proceedings (NeurIPS, ICML, ICLR, CVPR).
Next β Chapter 3: Scientific Data & Workflowsβnavigating the unique challenges of scientific datasets and integrating AI into research pipelines.