Chapter 1 β Generative AI: A New Frontier for Scientific Discovery
The New Frontier of Scientific Discovery
We stand at an extraordinary moment in the history of science. For centuries, the pace of discovery has been limited by human capacityβour ability to read literature, design experiments, analyze data, and synthesize knowledge. But a fundamental shift is underway. Generative Artificial Intelligence (AI) is not merely automating workflows; it is expanding what is scientifically possible [1].
Just as the microscope, the computer, and genome sequencing each transformed how we perceive nature, Generative AI now transforms our ability to understand and create scientific knowledge. AI systems design novel antibiotics by exploring chemical spaces that would take humans millennia to search [2, 3]. They predict protein structures with atomic precisionβwork recognized with the 2024 Nobel Prize in Chemistry [4, 5]. AI co-scientists generate hypotheses from vast textual corpora in days that previously required years of iterative research [6, 7], identify patterns in climate and Earth system data that are difficult to capture with traditional methods [8β10], and propose entirely new materials and reaction pathways [3, 11].
This is not science fictionβthis is the reality of Generative AI today [1].
The AI Revolution in Scientific Discovery
From Analysis to Generation
Scientific computing has evolved through clear stages:
| Era | Core Capability | Tools | Outcome |
|---|---|---|---|
| 1960sβ1980s | Digitization | Databases | Data preservation |
| 1990sβ2010s | Statistical & ML methods | Regression, SVMs, random forests | Pattern recognition |
| 2020sβ | Generative AI | Transformers, diffusion models | Creation of new data & hypotheses |
Traditional scientific discovery has often followed a cycle of observe β hypothesize β experiment β analyze β repeat [12]. Generative AI compresses and accelerates this loop [1, 7]:
- Literature review: Synthesizes insights from millions of papers [6, 7].
- Hypothesis generation: Identifies non-obvious links across fields, with AI co-scientists reducing hypothesis generation from weeks to days [1, 7].
- Experimental design: Proposes thousands of candidate molecules or materials for testing [2, 3, 11].
- Data analysis: Finds structure in high-dimensional scientific datasets [8β10, 12].
The Acceleration Effect
The effect is both quantitative and qualitative [1]:
- Speed: AI approaches can shorten drug discovery timelines from ~10β15 years to ~3β5 in some workflows, with AI-designed drugs showing 80β90% success rates in Phase I trials compared to traditional 40β65% [2, 13].
- Scale: Virtual screening has been used to evaluate extensive compound libraries in silico for drug discovery, supporting efficient prioritization of candidates[3].
- Efficiency: Weather forecasts that once required hours on supercomputers can be generated in minutes with AI models like GenCast [9, 14].
More profoundly, AI enables scientists to explore vast combinatorial spaces and ask what if? at scales that would be infeasible with manual or purely equation-based approaches [1, 12].
What Makes Generative AI Different from Traditional ML?
Traditional Machine Learning β Pattern Recognition
Traditional machine learning (ML) excels at discriminative tasks such as classification, regression, and clusteringβlearning mappings from inputs to outputs and identifying patterns in existing data [12].
Generative AI β Creating the New
Generative models go further: they learn the distribution underlying the data and can sample from it to create new, realistic instances [15, 16]. This enables capabilities such as:
| Capability | Example | Outcome |
|---|---|---|
| Generation | Design a new molecule with desired properties | Novel compounds [2, 3] |
| Completion | Fill missing regions in a protein structure or sequence | Plausible biological variants [4, 5] |
| Translation | Text description β molecule, code, or figure | Cross-modal synthesis [1, 6] |
| Synthesis | Combine text, code, and data | New simulations, analyses, or visualizations [6] |
Core Technologies Powering Generative AI
1. Transformers and Large Language Models (LLMs)
Transformers introduced an attention-based architecture that models long-range dependencies and contextual relationships in sequences [15]. Large Language Models (LLMs) built on this architecture can:
- Summarize and synthesize scientific literature
- Generate runnable analysis code and scripts
- Act as domain-aware assistants that interact with text, equations, and data [1, 6]
- Serve as AI co-scientists that generate novel hypotheses through multi-agent debate and evolution [7]
2. Diffusion Models
Diffusion models learn to reverse a gradual noising process to generate high-fidelity samples [16β18]. They are increasingly used for:
- Molecular and materials design [3, 11]
- Protein and structural biology tasks, including AlphaFold 3βs prediction of biomolecular interactions [4, 5]
- Probabilistic weather forecasting with models like GenCast [9, 14]
3. Variational Autoencoders (VAEs) and Generative Adversarial Networks (GANs)
Variational Autoencoders (VAEs) and Generative Adversarial Networks (GANs) provide earlier but still important frameworks for generative modeling [19, 20]. They support:
- Learning low-dimensional latent representations of complex systems
- Creating synthetic datasets to augment limited experimental data
- Constructing surrogate models for expensive simulations
The Pre-Training Revolution
Modern generative AI thrives on pre-training: models are first trained on large, generic corpora (text, code, images, or scientific data) and then fine-tuned on specific tasks or domains [6]. This strategy is particularly valuable in science, where high-quality labeled data are often scarce or costly to obtain.
A language model pre-trained on broad text can be adapted to domains such as biomedicine, materials science, or geosciences. Protein models pre-trained on millions of sequences can specialize in a narrow protein family with comparatively small task-specific datasets [4, 5].
Generative AI Across Scientific Disciplines
Physical and Chemical Sciences
In the physical and chemical sciences, generative AI is already reshaping how we search, design, and reason about complex systems:
- Materials Design: Exploring enormous chemical spaces to propose new superconductors, catalysts, or battery materials with targeted properties [3, 11].
- Reaction and Synthesis Planning: Suggesting reaction products and retrosynthetic routes, accelerating the path from concept to synthesis [3].
- Mathematical Reasoning: AlphaProof and AlphaGeometry 2 achieved silver-medal level performance at the 2024 International Mathematical Olympiad, and an advanced Gemini model achieved gold-medal standard in 2025 [21, 22].
Chemistry and Drug Discovery
In chemistry and drug discovery, generative models help to:
- Propose novel antibiotics and therapeutics with desired efficacy and safety profiles [2].
- Optimize molecules for multiple objectives (binding, solubility, toxicity, synthesizability) [3, 11].
- Accelerate clinical progress: AI-assisted drug candidates achieved Phase I success rates of nearly 90%, compared with industry averages of 40β65% [13].
Life and Environmental Sciences
- Protein Engineering: AlphaFold 3 predicts the structure and interactions of proteins, DNA, RNA, ligands, and ions with unprecedented accuracy, achieving at least 50% improvement over existing methods [5].
- Genomics and Systems Biology: Modeling sequenceβfunction relationships and generating candidate variants for experimental follow-up [1].
- Medical AI: Supporting risk prediction, trial design, and treatment-effect modeling when combined with causal and statistical frameworks [1, 12].
Climate and Earth System Science
- Weather Forecasting: GenCast, a diffusion-based probabilistic model, outperforms the worldβs leading operational forecast (ECMWF ENS) in 97% of scenarios, delivering 15-day forecasts in 8 minutes [9, 14].
- Earth System Foundations: Large foundation models that integrate atmosphere, ocean, land, and cryosphere processes, including Googleβs Aurora and WeatherNext models [10].
- Surrogate Modeling: Physics-informed and data-driven surrogates that emulate expensive numerical models at a fraction of the computational cost [8β10, 23].
Mathematical Foundations and Methods
Generative AI is also driving progress in the mathematics and methodology of scientific machine learning:
- Data-Driven Dynamical Systems: Learning governing dynamics and control laws directly from measurements or simulations [12].
- Physics-Informed Neural Networks (PINNs): Incorporating partial differential equations and other physical constraints directly into neural network training [24, 25].
- Uncertainty Quantification: Developing methods to quantify confidence and propagate uncertainty through AI-based scientific pipelines [25].
- Formal Mathematical Reasoning: AlphaProof combines reinforcement learning with formal theorem proving in Lean to generate verifiable mathematical proofs [21].

Figure 1.1. AI for Science: A New Era of Discovery. Traditional scientific discovery (left) is limited by human capacity, with drug development taking 10β15 years and early clinical trial success rates of 40β65%. Generative AI (right) is transforming this landscape by reducing drug discovery timelines to 3β5 years, improving clinical trial success rates to 80β90%, enabling novel molecule and materials design, predicting protein structures with unprecedented accuracy (AlphaFold 3), and accelerating weather forecasts from hours to minutes (GenCast). (Created by Google NotebookLM)
Cross-Cutting Capabilities
Several techniques and ideas cut across scientific domains:
- Physics-Informed Models: Embedding conservation laws, symmetries, and PDEs in neural architectures to ensure physically consistent predictions [24, 25].
- Latent-Variable and Generative Representations: Using VAEs, GANs, and diffusion models to learn compact representations of high-dimensional systems [16β20].
- Multimodal Integration: Combining text, images, time series, and spatial fields to build holistic models of complex phenomena [1, 6].
- Multi-Agent AI Systems: Frameworks like Googleβs AI co-scientist use specialized agents for hypothesis generation, debate, and evolution [7].
A New Scientific Partner
Generative AI is not replacing scientistsβit is becoming a new kind of collaborator [1]. It can:
- Expand the hypothesis space far beyond what any individual or team could enumerate
- Accelerate cycles of modeling, simulation, and experiment
- Enable researchers with limited computational or institutional resources to use state-of-the-art tools
- Bridge previously separate fields by revealing shared patterns in their data and models
- Recapitulate discoveries in days that previously required years of iterative research [7]
The most impactful discoveries are likely to emerge where human insight and machine-generated ideas interact in a tight loop.
The Path Forward
The rest of this book will help you:
- Understand the core architectures (transformers, diffusion models, scientific ML) [15β18, 24, 25].
- Adapt and fine-tune models to your own scientific domain [2β5, 11].
- Evaluate and validate models with appropriate statistical and physical checks [12, 23, 25].
- Design humanβAI workflows in which AI accelerates, rather than distorts, the scientific process [1].
The revolution is already underway. In my view, scientists who understand and apply generative AI will stand at the forefront of discovery in every field [1].
References and Further Readings
Wang, H., Fu, T., Du, Y., Gao, W., Huang, K., Liu, Z., et al. (2023). Scientific discovery in the age of artificial intelligence. Nature, 620(7972), 47β60. https://doi.org/10.1038/s41586-023-06221-2
Stokes, J. M., Yang, K., Swanson, K., Jin, W., Cubillos-Ruiz, A., Donghia, N. M., et al. (2020). A deep learning approach to antibiotic discovery. Cell, 180(4), 688β702. https://doi.org/10.1016/j.cell.2020.01.021
GΓ³mez-Bombarelli, R., Wei, J. N., Duvenaud, D., HernΓ‘ndez-Lobato, J. M., SΓ‘nchez-Lengeling, B., Sheberla, D., et al. (2018). Automatic chemical design using a data-driven continuous representation of molecules. ACS Central Science, 4(2), 268β276. https://doi.org/10.1021/acscentsci.7b00572
Jumper, J., Evans, R., Pritzel, A., Green, T., Figurnov, M., Ronneberger, O., et al. (2021). Highly accurate protein structure prediction with AlphaFold. Nature, 596(7873), 583β589. https://doi.org/10.1038/s41586-021-03819-2
Abramson, J., Adler, J., Dunger, J., Evans, R., Green, T., Pritzel, A., et al. (2024). Accurate structure prediction of biomolecular interactions with AlphaFold 3. Nature, 630(8016), 493β500. https://doi.org/10.1038/s41586-024-07487-w
Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., et al. (2020). Language models are few-shot learners. Advances in Neural Information Processing Systems, 33, 1877β1901. https://arxiv.org/abs/2005.14165
Gottweis, J., Weng, W.-H., Daryin, A., Liu, Y., Ceze, L., Natarajan, V., et al. (2025). Towards an AI co-scientist. arXiv preprint arXiv:2502.18864. https://arxiv.org/abs/2502.18864
Reichstein, M., Camps-Valls, G., Stevens, B., Jung, M., Denzler, J., Carvalhais, N., & Prabhat. (2019). Deep learning and process understanding for data-driven Earth system science. Nature, 566(7743), 195β204. https://doi.org/10.1038/s41586-019-0912-1
Price, I., Sanchez-Gonzalez, A., Alet, F., Andersson, T. R., El-Kadi, A., Masters, D., et al. (2024). Probabilistic weather forecasting with machine learning. Nature, 636, 84β90. https://doi.org/10.1038/s41586-024-08252-9
Bodnar, C., Bruinsma, W. P., Lucic, A., Stanley, M., Allen, A., Brandstetter, J., et al. (2024). Aurora: A foundation model of the atmosphere. arXiv preprint arXiv:2405.13063. https://arxiv.org/abs/2405.13063
Angello, N. H., Friday, D. M., Hwang, C., Yi, S., Cheng, A. H., Torres-Flores, T. C., et al. (2024). Closed-loop transfer enables artificial intelligence to yield chemical knowledge. Nature, 633(8029), 351β358. https://doi.org/10.1038/s41586-024-07021-y
Brunton, S. L., & Kutz, J. N. (2022). Data-driven science and engineering: Machine learning, dynamical systems, and control (2nd ed.). Cambridge University Press. https://doi.org/10.1017/9781108380690
AI in Drug Discovery (2024). Industry analysis reports indicate AI-assisted drug candidates achieving Phase I success rates of nearly 90%, compared with industry averages of 40β65%. Drug Discovery News, October 2025. https://pubmed.ncbi.nlm.nih.gov/38692505
Google DeepMind (2024). GenCast predicts weather and the risks of extreme conditions with state-of-the-art accuracy. https://deepmind.google/blog/gencast-predicts-weather-and-the-risks-of-extreme-conditions-with-sota-accuracy/
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., et al. (2017). Attention is all you need. Advances in Neural Information Processing Systems, 30. https://arxiv.org/abs/1706.03762
Ho, J., Jain, A., & Abbeel, P. (2020). Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33, 6840β6851. https://arxiv.org/abs/2006.11239
Dhariwal, P., & Nichol, A. (2021). Diffusion models beat GANs on image synthesis. Advances in Neural Information Processing Systems, 34, 8780β8794. https://arxiv.org/abs/2105.05233
Song, Y., Sohl-Dickstein, J., Kingma, D. P., Kumar, A., Ermon, S., & Poole, B. (2021). Score-based generative modeling through stochastic differential equations. Proceedings of ICLR. https://arxiv.org/abs/2011.13456
Kingma, D. P., & Welling, M. (2014). Auto-encoding variational Bayes. Proceedings of ICLR. https://arxiv.org/abs/1312.6114
Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., et al. (2014). Generative adversarial networks. Advances in Neural Information Processing Systems, 27. https://arxiv.org/abs/1406.2661
Google DeepMind (2024). AI achieves silver-medal standard solving International Mathematical Olympiad problems. https://deepmind.google/blog/ai-solves-imo-problems-at-silver-medal-level/
Google DeepMind (2025). Advanced version of Gemini with Deep Think officially achieves gold-medal standard at the International Mathematical Olympiad. https://deepmind.google/blog/advanced-version-of-gemini-with-deep-think-officially-achieves-gold-medal-standard-at-the-international-mathematical-olympiad/
Carleo, G., Cirac, I., Cranmer, K., Daudet, L., Schuld, M., Tishby, N., et al. (2019). Machine learning and the physical sciences. Reviews of Modern Physics, 91(4), 045002. https://doi.org/10.1103/RevModPhys.91.045002
Raissi, M., Perdikaris, P., & Karniadakis, G. E. (2019). Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations. Journal of Computational Physics, 378, 686β707. https://doi.org/10.1016/j.jcp.2018.10.045
Karniadakis, G. E., Kevrekidis, I. G., Lu, L., Perdikaris, P., Wang, S., & Yang, L. (2021). Physics-informed machine learning. Nature Reviews Physics, 3(6), 422β440. https://doi.org/10.1038/s42254-021-00314-5
Next β Chapter 2: introduces the core architectures powering generative AI.