Chapter 2: Generative AI Fundamentals

Introduction: The Building Blocks of Generation

To harness generative AI for scientific discovery, we must understand how these models work—not as black boxes, but as computational systems with clear mathematical foundations and architectural principles. This chapter demystifies the core technologies powering modern generative AI: transformers, diffusion models, flow matching, and variational approaches using plain English. We’ll explore their mechanisms, understand when to use each, and see how they can be adapted for scientific applications.

While the mathematics can become complex, our focus is on building intuition. Scientists don’t need to be machine learning experts to use these tools effectively, but understanding the fundamentals will help you choose the right model architecture, diagnose failures, adapt pre-trained models, collaborate with computational researchers, and evaluate limitations.

* * *

The Three Pillars of Generative AI

Modern generative AI rests on three major architectural families:

Architecture	Core Mechanism	Best For	Example Applications
Transformers & LLMs	Self-attention over sequences	Text, code, sequences	Literature synthesis, protein sequences, code generation
Diffusion & Flow Models	Iterative denoising / flow matching	Structured outputs, images	Molecular structures, protein folding, climate data
VAEs & GANs	Latent space learning	Data generation, interpolation	Synthetic data, anomaly detection, compression

* * *

Part I: Transformers and Large Language Models

The Attention Revolution

Traditional neural networks process sequences step-by-step. Transformers changed this with attention: let every element directly attend to every other element simultaneously [1]. The breakthrough “Attention Is All You Need” paper introduced self-attention mechanisms that model long-range dependencies in parallel, eliminating the sequential bottleneck of recurrent networks.

The Attention Mechanism

Attention computes three quantities for each sequence element [1]:

Component	Role	Intuition
Query (Q)	What am I looking for?	The question each element asks
Key (K)	What do I contain?	How each element describes itself
Value (V)	What do I contribute?	The information each element provides

Attention formula:

Attention(Q, K, V) = softmax(Q·K^T / √d_k) · V

Here, “softmax” turns raw similarity scores into a meaningful probability distribution that tells the model how much attention to pay to each token.

Why This Works for Science:

Protein Sequences: Connect distant amino acids that interact in 3D
Scientific Literature: Link concepts mentioned paragraphs apart
Chemical Reactions: Identify relationships between reactants and products

The Transformer Architecture

The original encoder-decoder architecture [1] consists of stacked attention and feed-forward layers:

Input → Embedding → Positional Encoding
    ↓
[ Multi-Head Attention → Add & Norm
    ↓
  Feed-Forward → Add & Norm ] × N layers
    ↓
Output Probabilities

Key Components:

Component	Purpose	Scientific Benefit
Multi-Head Attention	Learn different patterns	Capture multiple relationship types
Residual Connections	Enable deep networks	Scale to billions of parameters
Layer Normalization	Stabilize training	Faster convergence
Positional Encoding	Track sequence order	Understand structure (DNA direction, time series)

From Transformers to Large Language Models

1. Pre-Training: Learning General Patterns

Objective: Predict next token given context.

Input:  "The protein binds to the"
Target: "receptor"

The GPT series demonstrated the power of autoregressive language modeling at scale [2, 3], while BERT showed bidirectional pre-training for understanding tasks [4]. More recent models like LLaMA [5], Gemini [6], and Claude push the boundaries with trillions of tokens and multimodal capabilities.

Scientific Pre-Trained Models:

SciBERT [7]: Scientific papers
ESM-2 [8]: Protein sequences (750M sequences)
ESM-3 [9]: Multimodal protein model (sequence, structure, function) with 98B parameters (2024/2025)
Llama 3.1/3.2/3.3 [10]: Open foundation models with 128K context (2024)

2. Fine-Tuning: Domain Specialization

Full fine-tuning updates all model parameters, but parameter-efficient methods like LoRA [11] enable adaptation with minimal computational cost by learning low-rank updates to weight matrices.

Method	Trainable Params	Data Needed	Use Case
Full Fine-Tuning	100%	10K-100K	Maximum adaptation
LoRA [11]	0.1-1%	100-10K	Limited compute
Adapters [12]	1-5%	1K-10K	Task-specific layers
QLoRA [13]	0.1-1%	100-10K	Quantized + LoRA (fine-tune 70B on consumer GPU)

3. Prompting: Zero-Shot and Few-Shot

Large language models demonstrate remarkable few-shot learning capabilities [3], enabling scientific applications without task-specific training.

Zero-Shot:

Summarize this oceanography paper: [text]

Few-Shot:

Examples:
SMILES: CCO → Name: Ethanol
SMILES: CC(C)O → Name: Propan-2-ol

SMILES: CCCO → Name: ?

Reasoning Models: A New Paradigm (2024–2025)

A significant development in 2024–2025 is the emergence of reasoning models that “think before they respond” [14, 15]. Unlike standard LLMs that generate answers in a single pass, reasoning models like OpenAI’s o1/o3 series and Gemini Deep Think produce an internal chain of thought before responding.

Key Characteristics:

Extended reasoning time for complex problems
Multi-step planning and self-verification
Superior performance on math, coding, and scientific reasoning
Configurable “reasoning effort” levels

Scientific Applications:

Complex mathematical proofs
Multi-step experimental design
Code debugging and algorithm implementation
Hypothesis evaluation requiring logical chains

Limitations for Science

Challenge	Impact	Mitigation
Hallucination	False information	RAG [16], verification, reasoning models
Lack of Uncertainty	Overconfidence	Ensemble methods [17]
Data Cutoff	Outdated knowledge	Fine-tuning, RAG [16]
Context Limits	Long document handling	Extended context models (128K+ tokens)

* * *

Part II: Diffusion Models and Flow Matching

The Denoising Paradigm

Diffusion models learn to generate by reversing a noise-addition process [18, 19]. The foundational work by Sohl-Dickstein et al. [18] established the thermodynamic interpretation, while Ho et al. [19] simplified training with the denoising objective.

Forward Process: Adding Noise

x₀ (real data) → x₁ → x₂ → ... → xₜ (pure noise)

At each step: xₜ = √(1-βₜ)·xₜ₋₁ + √βₜ·ε where ε ~ N(0,I)

Reverse Process: Learning to Denoise

Train a network to predict the noise [19]:

Loss = E[||ε - εₚᵣₑ𝒹(xₜ, t)||²]

Generation: Sampling

Start with noise: xₜ ~ N(0, I)
Iteratively denoise: xₜ → xₜ₋₁ → ... → x₀
Result: novel sample

Flow Matching: A Powerful Alternative (2023–2025)

Flow Matching [20] has emerged as a powerful and efficient alternative to diffusion-based generative modeling, with growing interest in scientific applications [21]. Rather than learning to reverse a noising process, flow matching learns a velocity field that transports samples from noise to data along continuous paths.

Key Advantages:

Straighter trajectories: Fewer sampling steps required
Stable training: Simpler objectives, less hyperparameter tuning
Faster inference: Often 2-4x speedup over diffusion
Flexible conditioning: Natural incorporation of constraints

Flow Matching in Biology (2024–2025):

Molecule generation (NeurIPS 2023) [21]
Protein backbone generation with SE(3)-equivariant flows (ICLR 2024) [21]
Antibody design with IgFlow and dyAb [21]
Biological sequence and peptide generation (ICML 2024) [21]

Why Diffusion/Flow Works for Science

Score-based generative modeling [22] provides a continuous-time perspective, while empirical results show diffusion models produce higher-quality samples than GANs [23].

Feature	Scientific Benefit
High Quality	Realistic structures
Stable Training	Easier than GANs [23]
Interpretable	Visualize generation
Conditional	Incorporate constraints [24]
Uncertainty	Multiple samples

Applications:

Molecular conformations with valency constraints [25]
Protein structures respecting physics [26]
Climate data filling gaps while conserving energy
High-resolution image synthesis with latent diffusion [27]
Probabilistic weather forecasting with GenCast [28]

Conditional Diffusion

Generate data with specific properties using classifier guidance or classifier-free guidance [24]:

p(xₜ₋₁ | xₜ, condition)

Conditioning Methods:

Method	Example
Classifier Guidance [23]	“Molecule binds to protein X”
Classifier-Free [24]	Train with/without conditions
Inpainting	Fill missing climate data

* * *

Part III: VAEs and GANs

Variational Autoencoders (VAEs)

VAEs learn probabilistic latent representations through variational inference [29]:

Architecture:

Encoder: x → [μ(x), σ(x)]
Sample: z ~ N(μ, σ²)
Decoder: z → x̂

Loss:

Total = Reconstruction + KL_Divergence

Scientific Uses:

Chemistry: Interpolate between molecules [30]
Materials: Optimize in latent space
Genomics: Compress gene expression data

Generative Adversarial Networks (GANs)

GANs introduced adversarial training between generator and discriminator [31]:

Two Networks in Competition:

Network	Role	Goal
Generator	Create fakes	Fool discriminator
Discriminator	Judge real/fake	Detect fakes

Challenges:

Mode collapse (limited diversity)
Training instability
Difficult evaluation

Scientific Uses:

Data augmentation
Super-resolution
Image-to-image translation
Molecular graph generation [32]

Comparison

Criterion	VAE	GAN	Diffusion	Flow Matching
Quality	Good	Excellent	Excellent	Excellent
Stability	Stable	Unstable	Stable	Very Stable
Diversity	Good	Poor	Excellent	Excellent
Speed	Fast	Fast	Slow	Moderate
Latent Space	Interpretable	Opaque	Implicit	Implicit

* * *

Part IV: Pre-Training and Fine-Tuning

The Transfer Learning Pipeline

Pre-Training (general data, millions of examples)
    ↓
Fine-Tuning (domain data, thousands of examples)
    ↓
Prompting (task-specific, zero examples)

Foundation models [33] trained on massive corpora can be adapted to downstream scientific tasks with limited data.

Parameter-Efficient Fine-Tuning (PEFT)

LoRA Example [11]:

from peft import LoraConfig, get_peft_model

config = LoraConfig(
    r=16,  # Adaptation rank
    lora_alpha=32,
    target_modules=["q_proj", "v_proj"]
)

model = get_peft_model(base_model, config)
# Now only 0.5% of parameters are trainable!

QLoRA [13] combines quantization with LoRA, enabling fine-tuning of 70B models on consumer GPUs—a game-changer for scientific labs with limited compute budgets.

Benefits:

Train on single GPU
Fast iteration
Multiple task-specific versions

Open vs. Closed Models for Science

The 2024–2025 landscape offers scientists unprecedented choice:

Model Family	Access	Parameters	Best For
Llama 3.1 / 3.2 / 3.3	Open weights	1B–405B (text); 11B / 90B (vision variants)	Local deployment, fine-tuning, on-prem and private customization
OpenAI GPT-4/4o/5.1/5.2	API (and ChatGPT)	Not disclosed	Advanced reasoning, multimodal understanding, coding, tool-using agents
Gemini 2.0/3.0 (Pro, Flash)	API (Gemini API, Vertex AI)	Not disclosed	Multimodal tasks, very long context (up to ~1M tokens on supported models)
Claude (Claude 3.5, Sonnet/Opus 4.5)	API	Not disclosed	Deep analysis, coding, safety-oriented assistants, long-context workflows
ESM-3	API + open weights (ESM3-open)	1.4B (open); up to ~98B (largest)	Protein modeling, structure/function prediction, generative protein design

* * *

Part V: Mathematical Foundations

Low-Rank Approximation

Large models use low-rank structures for efficiency [11].

Matrix Factorization:

W ∈ R^(m×n) ≈ A·B where A ∈ R^(m×r), B ∈ R^(r×n), r << min(m,n)

Applications:

LoRA adaptation [11]
Model compression
Fast inference

Quantization

Reduce precision for efficiency [34, 35]:

Precision	Bits	Range	Use Case
FP32	32	Full	Training
FP16/BF16	16	Mixed	Fast training
INT8	8	Limited	Inference
INT4	4	Very limited	Edge deployment, QLoRA

LLM.int8() [34] and GPTQ [35] enable accurate quantization without retraining.

Benefits:

4× smaller models (FP32 → INT8)
2-4× faster inference
Run large models on consumer GPUs

Optimization Theory

Key Algorithms:

Optimizer	Mechanism	When to Use
SGD	Basic gradient descent	Small models, well-understood problems
Adam [36]	Adaptive learning rates	Default choice, most robust
AdamW [37]	Adam + weight decay	Large language models

Learning Rate Schedules:

Warmup → Constant → Decay

Warmup: Start small, gradually increase (stabilize training)
Constant: Maintain learning rate (main training)
Decay: Reduce toward end (fine-tune solution)

* * *

Part VI: Types of Generative AI by Modality

Text Generation

Capabilities:

Scientific writing
Code generation
Literature synthesis
Hypothesis generation

Models: GPT-4o/4.5/o1/o3 [14, 15], Llama 3.1/3.2/3.3/4 [10], Gemini 2.0/3.0 [6], Claude.

Image Generation

Capabilities:

Synthetic microscopy
Medical imaging
Scientific visualization
Data augmentation

Models: Stable Diffusion 3/3.5 [38], SDXL [27], DALL-E 3

Molecular Generation

Capabilities:

Drug discovery
Material design
Reaction prediction

Representations:

SMILES strings (text-like)
Molecular graphs [32]
3D conformations [25]

Models: MolGAN [32], Diffusion-based generators [25], VAEs [30], Flow matching models [21]

Protein Generation

Capabilities:

Structure prediction (AlphaFold 3 [39], ESMFold [8])
Sequence design
Function prediction
Multimodal generation (ESM-3) [9]

Models: ESM-2 [8], ESM-3 [9], ProtGPT2 [40], RFdiffusion [26]

2024–2025 Highlight: ESM-3 [9] ESM-3, published in Science (January 2025), is a 98B parameter multimodal generative model that reasons over protein sequence, structure, and function simultaneously. It generated a novel green fluorescent protein (GFP) with only 58% sequence identity to known GFPs—equivalent to simulating 500 million years of evolution.

Graph Generation

Capabilities:

Molecular graphs
Protein-protein interaction networks
Knowledge graphs

Models: GraphRNN, MolGAN [32], GraphVAE

Multimodal Generation

Capabilities:

Text → Image (CLIP [41] + diffusion)
Text → Molecule (MolT5)
Image → Text (scientific figure captioning)
Cross-modal retrieval

Models: CLIP [41], GPT-4o/4V [15], Gemini 2.0 [6]

* * *

Design Principles for Scientific Applications

1. Incorporate Domain Knowledge

Physics-Informed Neural Networks (PINNs) [42]: Embed differential equations in loss function:

Loss = Data_Loss + λ·Physics_Loss

Physics_Loss = ||∂u/∂t - α∇²u||²  (heat equation)

Benefits:

Better generalization
Physical consistency
Data efficiency

2. Quantify Uncertainty

Methods:

Ensemble [17]: Train multiple models, measure spread
Bayesian [43]: Maintain distributions over parameters
Conformal [44]: Finite-sample coverage guarantees

Why Essential: Scientific decisions have consequences—we must know when models are uncertain.

3. Ensure Reproducibility

Checklist:

Set random seeds
Log hyperparameters
Version control data and code
Document environment
Share trained models

4. Validate Rigorously

Validation Type	Purpose
Hold-out Test	Independent performance
Cross-validation	Robust estimates
Domain Expert	Scientific plausibility
Physical Constraints	Obey natural laws
Out-of-distribution	Generalization limits

* * *

Practical Considerations

Model Selection Guide

Your Goal	Recommended Architecture	Rationale
Generate scientific text	Transformer LLM [2, 3, 10]	Excellent at language
Complex reasoning/math	Reasoning models (o1/o3) [14, 15]	Multi-step verification
Design molecules	Diffusion [25] or Flow Matching [21]	High-quality structures
Predict protein structure	ESM-3 [9], AlphaFold 3 [39]	State-of-the-art specialized models
Fill missing data	Conditional diffusion [24]	Respects constraints
Synthetic data augmentation	GAN [31] or VAE [29]	Fast generation
Explore design space	VAE [29]	Interpretable latent space

Computational Requirements

Task	Typical Resources	Alternative
Fine-tune small LLM	Single GPU, 1-2 days	Use LoRA/QLoRA [11, 13] for efficiency
Train diffusion model	4-8 GPUs, 3-7 days	Use pre-trained models [27]
Inference (LLM)	1 GPU or CPU	Quantization [34, 35] for speed
Protein structure	1 GPU, minutes	Use hosted APIs

Common Pitfalls

Pitfall	Consequence	Solution
Overfitting	Poor generalization	Regularization, more data
Underfitting	Poor performance	Larger model, more training
Data leakage	Inflated metrics	Careful splitting
Ignoring uncertainty	Overconfident predictions	Uncertainty quantification [17, 43, 44]
Black-box use	Unexplainable failures	Understand fundamentals

* * *

Summary

This chapter introduced the core architectures powering generative AI:

Transformers [1] excel at sequential data through attention mechanisms, enabling scientific text generation [2, 3, 10], code synthesis, and protein language models [8, 9]. The emergence of reasoning models [14, 15] in 2024–2025 represents a paradigm shift for complex scientific problem-solving.

Diffusion models [18, 19, 22] and flow matching [20, 21] generate high-quality structured outputs by learning to reverse noise or transport samples along continuous paths, ideal for molecular design [25], protein structures [26], and climate data reconstruction [28].

VAEs [29] and GANs [31] provide complementary approaches for data generation, exploration, and augmentation.

Transfer learning [33]—pre-training followed by fine-tuning [11, 12, 13]—allows scientists to leverage massive general models with modest domain-specific data. Open models like Llama 3/4 [10] democratize access for scientific labs.

Mathematical foundations like low-rank approximation [11], quantization [34, 35], and optimization theory [36, 37] enable efficient training and deployment.

Understanding these fundamentals empowers you to:

Choose appropriate models for your scientific problems
Adapt existing models to new domains
Diagnose and fix failures
Collaborate effectively with AI researchers
Push the boundaries of what’s possible in your field

* * *

References

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., et al. (2017). Attention is all you need. Advances in Neural Information Processing Systems, 30. https://arxiv.org/abs/1706.03762
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., & Sutskever, I. (2019). Language models are unsupervised multitask learners (GPT-2). OpenAI Blog. https://openai.com/research/better-language-models
Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., et al. (2020). Language models are few-shot learners (GPT-3). Advances in Neural Information Processing Systems, 33, 1877–1901. https://arxiv.org/abs/2005.14165
Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. Proceedings of NAACL-HLT, 4171–4186. https://arxiv.org/abs/1810.04805
Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M. A., Lacroix, T., et al. (2023). LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971. https://arxiv.org/abs/2302.13971
Gemini Team, Google. (2024). Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv preprint arXiv:2403.05530. https://arxiv.org/abs/2403.05530
Beltagy, I., Lo, K., & Cohan, A. (2019). SciBERT: A pretrained language model for scientific text. Proceedings of EMNLP-IJCNLP, 3615–3620. https://arxiv.org/abs/1903.10676
Lin, Z., Akin, H., Rao, R., Hie, B., Zhu, Z., Lu, W., et al. (2023). Evolutionary-scale prediction of atomic-level protein structure with a language model (ESM-2/ESMFold). Science, 379(6637), 1123–1130. https://doi.org/10.1126/science.ade2574
Hayes, T., Rao, R., Akin, H., Sofroniew, N. J., Oktay, D., Lin, Z., et al. (2025). Simulating 500 million years of evolution with a language model (ESM-3). Science, 387(6736), 850–858. https://doi.org/10.1126/science.ads0018
Meta AI. (2024). Llama 3.1, 3.2, and 3.3: The most capable openly available LLMs. Meta AI Blog. https://ai.meta.com/blog/meta-llama-3-1/
Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., et al. (2022). LoRA: Low-rank adaptation of large language models. Proceedings of ICLR. https://arxiv.org/abs/2106.09685
Houlsby, N., Giurgiu, A., Jastrzebski, S., Morrone, B., De Laroussilhe, Q., Gesmundo, A., et al. (2019). Parameter-efficient transfer learning for NLP. Proceedings of ICML, 2790–2799. https://arxiv.org/abs/1902.00751
Dettmers, T., Pagnoni, A., Holtzman, A., & Zettlemoyer, L. (2024). QLoRA: Efficient finetuning of quantized LLMs. Proceedings of NeurIPS, 36. https://arxiv.org/abs/2305.14314
OpenAI. (2024). Introducing OpenAI o1. OpenAI Blog. https://openai.com/index/introducing-openai-o1-preview/
OpenAI. (2024). GPT-4o and reasoning models. OpenAI Platform. https://platform.openai.com/docs/models
Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., et al. (2020). Retrieval-augmented generation for knowledge-intensive NLP tasks. Proceedings of NeurIPS, 33, 9459–9474. https://arxiv.org/abs/2005.11401
Lakshminarayanan, B., Pritzel, A., & Blundell, C. (2017). Simple and scalable predictive uncertainty estimation using deep ensembles. Proceedings of NeurIPS, 30. https://arxiv.org/abs/1612.01474
Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N., & Ganguli, S. (2015). Deep unsupervised learning using nonequilibrium thermodynamics. Proceedings of ICML, 2256–2265. https://arxiv.org/abs/1503.03585
Ho, J., Jain, A., & Abbeel, P. (2020). Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33, 6840–6851. https://arxiv.org/abs/2006.11239
Lipman, Y., Chen, R. T. Q., Ben-Hamu, H., Nickel, M., & Le, M. (2023). Flow matching for generative modeling. Proceedings of ICLR. https://arxiv.org/abs/2210.02747
Li, Z., Zeng, Z., Lin, X., Fang, F., Qu, Y., Xu, Z., et al. (2025). Flow matching meets biology and life science: A survey. arXiv preprint arXiv:2507.17731. https://arxiv.org/abs/2507.17731
Song, Y., Sohl-Dickstein, J., Kingma, D. P., Kumar, A., Ermon, S., & Poole, B. (2021). Score-based generative modeling through stochastic differential equations. Proceedings of ICLR. https://arxiv.org/abs/2011.13456
Dhariwal, P., & Nichol, A. (2021). Diffusion models beat GANs on image synthesis. Advances in Neural Information Processing Systems, 34, 8780–8794. https://arxiv.org/abs/2105.05233
Ho, J., & Salimans, T. (2022). Classifier-free diffusion guidance. NeurIPS Workshop on Score-Based Methods. https://arxiv.org/abs/2207.12598
Hoogeboom, E., Satorras, V. G., Vignac, C., & Welling, M. (2022). Equivariant diffusion for molecule generation in 3D. Proceedings of ICML, 8867–8887. https://arxiv.org/abs/2203.17003
Watson, J. L., Juergens, D., Bennett, N. R., Trippe, B. L., Yim, J., Eisenach, H. E., et al. (2023). De novo design of protein structure and function with RFdiffusion. Nature, 620(7976), 1089–1100. https://doi.org/10.1038/s41586-023-06415-8
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., & Ommer, B. (2022). High-resolution image synthesis with latent diffusion models (Stable Diffusion). Proceedings of CVPR, 10684–10695. https://arxiv.org/abs/2112.10752
Price, I., Sanchez-Gonzalez, A., Alet, F., Andersson, T. R., El-Kadi, A., Masters, D., et al. (2024). Probabilistic weather forecasting with machine learning (GenCast). Nature, 636, 84–90. https://doi.org/10.1038/s41586-024-08252-9
Kingma, D. P., & Welling, M. (2014). Auto-encoding variational Bayes. Proceedings of ICLR. https://arxiv.org/abs/1312.6114
Gómez-Bombarelli, R., Wei, J. N., Duvenaud, D., Hernández-Lobato, J. M., Sánchez-Lengeling, B., Sheberla, D., et al. (2018). Automatic chemical design using a data-driven continuous representation of molecules. ACS Central Science, 4(2), 268–276. https://doi.org/10.1021/acscentsci.7b00572
Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., et al. (2014). Generative adversarial networks. Advances in Neural Information Processing Systems, 27. https://arxiv.org/abs/1406.2661
De Cao, N., & Kipf, T. (2018). MolGAN: An implicit generative model for small molecular graphs. ICML Workshop on Theoretical Foundations and Applications of Deep Generative Models. https://arxiv.org/abs/1805.11973
Bommasani, R., Hudson, D. A., Adeli, E., Altman, R., Arora, S., von Arx, S., et al. (2021). On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258. https://arxiv.org/abs/2108.07258
Dettmers, T., Lewis, M., Belkada, Y., & Zettlemoyer, L. (2022). LLM.int8(): 8-bit matrix multiplication for transformers at scale. Proceedings of NeurIPS, 35, 30318–30332. https://arxiv.org/abs/2208.07339
Frantar, E., Ashkboos, S., Hoefler, T., & Alistarh, D. (2023). GPTQ: Accurate post-training quantization for generative pre-trained transformers. Proceedings of ICLR. https://arxiv.org/abs/2210.17323
Kingma, D. P., & Ba, J. (2015). Adam: A method for stochastic optimization. Proceedings of ICLR. https://arxiv.org/abs/1412.6980
Loshchilov, I., & Hutter, F. (2019). Decoupled weight decay regularization (AdamW). Proceedings of ICLR. https://arxiv.org/abs/1711.05101
Stability AI. (2024). Stable Diffusion 3: Scaling rectified flow transformers for high-resolution image synthesis. https://stability.ai/stable-image
Abramson, J., Adler, J., Dunger, J., Evans, R., Green, T., Pritzel, A., et al. (2024). Accurate structure prediction of biomolecular interactions with AlphaFold 3. Nature, 630(8016), 493–500. https://doi.org/10.1038/s41586-024-07487-w
Ferruz, N., Schmidt, S., & Höcker, B. (2022). ProtGPT2 is a deep unsupervised language model for protein design. Nature Communications, 13(1), 4348. https://doi.org/10.1038/s41467-022-32007-7
Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., et al. (2021). Learning transferable visual models from natural language supervision (CLIP). Proceedings of ICML, 8748–8763. https://arxiv.org/abs/2103.00020
Raissi, M., Perdikaris, P., & Karniadakis, G. E. (2019). Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations. Journal of Computational Physics, 378, 686–707. https://doi.org/10.1016/j.jcp.2018.10.045
Gal, Y., & Ghahramani, Z. (2016). Dropout as a Bayesian approximation: Representing model uncertainty in deep learning. Proceedings of ICML, 1050–1059. https://arxiv.org/abs/1506.02142
Angelopoulos, A. N., & Bates, S. (2021). A gentle introduction to conformal prediction and distribution-free uncertainty quantification. arXiv preprint arXiv:2107.07511. https://arxiv.org/abs/2107.07511

Additional Resources

Textbooks

Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press. http://www.deeplearningbook.org
Prince, S. J. D. (2023). Understanding Deep Learning. MIT Press. http://udlbook.github.io/udlbook/
Tunstall, L., von Werra, L., & Wolf, T. (2022). Natural Language Processing with Transformers. O’Reilly Media.
Alammar, J., & van den Oord, A. (2024). Generative Deep Learning (2nd ed.). O’Reilly Media. https://www.oreilly.com/library/view/generative-deep-learning/9781098134174/
Raschka, S. (2024). Build a Large Language Model (From Scratch). Manning. https://www.manning.com/books/build-a-large-language-model-from-scratch
Holderrieth, P., & Erives, E. (2025). Introduction to Flow Matching and Diffusion Models. MIT Course 6.S184. https://diffusion.csail.mit.edu/
Liu, J.P. (2025). How to Build and Fine-tune a Small Language Model. Leanpub. https://leanpub.com/howtobuildandfine-tuneasmalllanguagemodel

Key Review Papers

Yang, L., et al. (2023). Diffusion models: A comprehensive survey. ACM Computing Surveys, 56(4), 1–39.
Bommasani, R., et al. (2022). On the opportunities and risks of foundation models. https://arxiv.org/abs/2108.07258
Zhao, W. X., et al. (2023). A survey of large language models. ACM Computing Surveys. https://arxiv.org/abs/2303.18223
Li, Z., et al. (2025). Flow matching meets biology and life science: A survey. https://arxiv.org/abs/2507.17731

* * *

This chapter provides the foundational knowledge needed to understand and apply generative AI in scientific contexts. For the latest developments, regularly check arXiv (cs.LG, cs.CL, q-bio sections) and major conference proceedings (NeurIPS, ICML, ICLR, CVPR).

* * *

Next → Chapter 3: Scientific Data & Workflows—navigating the unique challenges of scientific datasets and integrating AI into research pipelines.

Up next

Chapter 3: Scientific Data & Workflows