Exploring Grokking using the LEMBAS Framework
Abstract
In this project we strive to investigate if grokking behavior is observed when training a model on synthetic biological data. Grokking is a phenomenon during model training where generalization begins to occur many iterations after seemingly little training progress is made (https://en.wikipedia.org/wiki/Grokking_(machine_learning)). This project aims to determine if grokking will be observed in much more complex domains than where it was originally discovered, specifically the domain of synthetic biological data. Discovering that this is possible, would be a novel finding for the machine learning community that would show the need for a larger collected effort in understanding the fundamentals of grokking, but would also provide valuable insight in how models applied to biological data could be trained.
For this study, the LEMBAS framework [1] will be used. It is a deep learning model of cellular signaling that has proven useful in predicting transcription in response to ligand stimulation and drugs, and for generating biologically interpretable predictions.
The synthetic data that will be used was generated using a specific parametrization of the LEMBAS model, designed to capture distributional behavior similar to that of a biological system, capturing many but not all the aspects of those systems.
Local experiments have shown promise to be successful, but much larger computational resources are needed to test the full strength of the approach.
[1] Nilsson, A., Peters, J.M., Meimetis, N. et al. Artificial neural networks enable genome-scale simulations of intracellular signaling. Nat Commun 13, 3069 (2022). https://doi.org/10.1038/s41467-022-30684-y