Optimizing species tree estimation in the presence of ILS and migration using DENIM with simulated sequences
Title: Optimizing species tree estimation in the presence of ILS and migration using DENIM with simulated sequences
DNr: NAISS 2024/23-366
Project Type: NAISS Small Storage
Principal Investigator: Bengt Oxelman <bengt.oxelman@bioenv.gu.se>
Affiliation: Göteborgs universitet
Duration: 2024-06-01 – 2024-11-01
Classification: 10610


Note: The start of this project became delayed, but is now up and running well, and it is anticipated that the work will continue with "medium"-size demands for another year. This research is concerned with optimizing the computational efficiency and accuracy with which evolutionary relationships are resolved using the multispecies coalescent (MSC)-based method DENIM (Jones, 2019) as implemented in the Bayesian phylogenetic software BEAST 2 (Bouckaert et al., 2019). This method, like several other recently developed isolation-with-migration and MSC-with-introgression methods (IMa3, Hey, 2010; Hey et al., 2018; AIM, Muller et al., 2018; Muller et al., 2021 and PhyloNet, Wen & Nakhleh, 2018), accounts for two sources of discord among genes in the estimation of the species tree – incomplete lineage sorting (ILS), and also migration. Explicitly modelling post-speciational gene flow as part of the evolutionary process means that biases in purely MSC-based inference are minimized (Leaché et al., 2014), but entails the estimation of additional migration parameters. Reliable parameter estimation, in turn, requires the use of genome-scale data, like that captured by massively-parallel sequencing. The net result is a more richly parameterized model using a large number (hundreds to thousands) of putatively unlinked loci, which may take weeks or even months to reach acceptable convergence. Our objective in this work is to (i) serially subsample a simulated sequence data set to assess the extent to which the addition of loci results in appreciable returns on investment in phylogenetic resolution, (ii) assess the extent to which restricting the timing and directionality of migration (using a flexible DENIM migration model) reduces computation time without compromising the integrity of tree estimation, and (iii) explore the effects of migration between extant and unsampled or extinct taxa (“ghost lineages”) which is not accommodated in the birth-death model (Heled & Drummond, 2010) by pruning accessions in the simulated molecular data set. Our work entails many iterations of the DENIM model under molecular data sets of varying size under varying migration parameter settings. The proposed work is a rapidly-developing area of research in biological systematics. Insights provided by our work will inform project design in molecular phylogenetics, and will also help inform best practices for tree estimation using fully-parameterized Bayesian tree inference with migration.