Foundation Models for Data-driven Human Cell Simulation
Title: Foundation Models for Data-driven Human Cell Simulation
DNr: Berzelius-2025-23
Project Type: LiU Berzelius
Principal Investigator: Wei Ouyang <weio@kth.se>
Affiliation: Kungliga Tekniska högskolan
Duration: 2025-02-01 – 2025-08-01
Classification: 10610
Homepage: https://aicell.io/project/human-cell-simulator/
Keywords:

Abstract

The human cell is extraordinarily intricate, governed by a complex network of molecular and biological processes that underlie both normal physiology and disease. Building a comprehensive “whole-cell model” capable of predicting cellular behavior under various conditions remains an aspirational but critically important goal for biomedical research and therapeutic development. Inspired by recent breakthroughs in AI-driven modeling—from AlphaFold’s success in protein structure prediction to the emerging paradigm of diffusion-based generative models—our project seeks to create large-scale, foundation models that accurately capture and simulate key aspects of human cells. In the first phases of this project, we focused on building the essential computational framework and validating our approach on smaller data subsets. We integrated Ray-based autoscaling within the Berzelius cluster, refined streaming data loaders for handling multi-terabyte image datasets, and developed and tested preliminary Variational Autoencoders (VAEs) for reconstruction tasks. These steps provided invaluable insights into optimizing high-performance computing pipelines and identified the crucial innovations needed for efficient, large-scale training—particularly for diffusion-based generative models. We are now poised to scale up substantially. Our central objective is to train diffusion models on up to 100 terabytes of image data, leveraging cutting-edge optimizations such as latent precomputation, fully sharded data parallelism, and advanced attention mechanisms. By building on the successes of large-scale diffusion studies, notably MosaicML’s demonstration of stable diffusion training at this scale, we aim to incorporate their best practices while adapting them to the specific challenges of biological imaging. Once trained, these diffusion models could enable previously unattainable insights into cellular dynamics, offering researchers an opportunity to “experiment” virtually with cellular states in a manner that complements or even supersedes traditional lab-based methods. Achieving these aims will not only yield new methodological frameworks for generative biology but could also have significant ripple effects across system biology, drug discovery, and personalized medicine. By showcasing how AI-driven models can handle petabyte-scale datasets and deliver biologically meaningful predictions, this project stands to further the ambition of true in-silico experimentation. In addition, our open-source model repositories and data pipelines will lower the barriers to entry for other researchers interested in large-scale, AI-assisted cell modeling, fostering a collaborative ecosystem for continued innovation. In sum, our proposal sets out to leverage Berzelius’s top-tier GPU resources to advance the frontier of biologically driven AI. With a robust, validated infrastructure now in place and a clear scientific roadmap ahead, we believe the project is well positioned to push the boundaries of what is computationally and scientifically possible in human cell simulation.