Scaling graph neural networks for molecular property prediction
Title: Scaling graph neural networks for molecular property prediction
DNr: Berzelius-2026-201
Project Type: LiU Berzelius
Principal Investigator: Ola Spjuth <ola.spjuth@uu.se>
Affiliation: Uppsala universitet
Duration: 2026-06-26 – 2027-01-01
Classification: 10210
Keywords:

Abstract

Quantitative structure-activity relationship (QSAR) modelling — predicting biological and physicochemical properties of small molecules from their structure — underpins much of computational drug discovery. Despite considerable interest in deep learning for QSAR, classical methods such as random forests and gradient boosting on molecular fingerprints remain highly competitive on the small labelled datasets typical of the field, and often outperform graph neural networks (GNNs) trained from scratch. The most promising route to closing this gap is large-scale self-supervised pretraining on unlabelled molecular corpora, transferring learned representations to downstream tasks with limited supervision — a strategy that has reshaped natural language processing and computer vision but is far less mature in molecular ML. This project connects to and bridges large-scale machine learning with cellular dynamics and automated screening by providing a foundation GNN that learns robust molecular representations, directly supporting the computational modeling of complex drug combinations and synergy mechanisms across our projects in data-driven pharmaceutical bioinformatics. Of particular interest is to connect the representations to model and drive image-based experiments studying cell dynamics upon chemical perturbations. This project will pretrain graph neural networks at scale on the ZINC database (30M–300M molecules) using a self-distillation framework adapted from DINO (Caron et al., 2021), in which a student network is trained to match a teacher's representations of differently-augmented views of the same molecule, with the teacher updated as an exponential moving average of the student. DINO-style pretraining has produced state-of-the-art representations in vision without labels or contrastive negatives, but its adaptation to molecular graphs remains underexplored. We will investigate molecular-graph augmentation strategies (subgraph masking, atom and bond perturbations, alternative graph permutations), model scaling behaviour, and the empirical determinants of downstream QSAR performance. Models will be evaluated on standard benchmarks from MoleculeNet under strict scaffold-split protocols with multi-seed confidence intervals, and benchmarked against strong classical baselines (random forest and gradient boosting on ECFP and physicochemical descriptors) to ensure deep learning is held to a meaningful standard, with particular attention to low-data and out-of-distribution regimes. Preliminary experiments by a masters student in the group, conducted on UPPMAX Pelle (NVIDIA T4) with a 9M-molecule ZINC subset, have established the training pipeline and baseline architecture; Berzelius access is required to scale to the model sizes and dataset volumes needed for the empirical study. Expected outcomes are a publicly released foundation GNN distributed via Hugging Face Hub with a permanent Zenodo DOI, a peer-reviewed publication on self-distillation for molecular representation learning, and reproducible code and evaluation harness to support future work in the Swedish ML-for-chemistry community.