Genetic landscapes | Flegontov lab

Genetic landscapes are manifolds obscured by poor sampling and data pruning: towards an optimal protocol for visualization of population structure

presented at MCEB 2025 in Granada, Spain

Principal component analysis (PCA) and related methods such as classic multidimensional scaling are routinely used as first-line analyses in population genetics studies. A fundamental problem of PCA is that the signal is spread over s–1 or v–1 dimensions, whichever is smaller (where s and v are counts of samples and variables, respectively), but only two- or, rarely, three-dimensional PC spaces are visualized and interpreted in the population genetic literature. And the ambient genetic space is generated by a genealogical process acting in approximately two-dimensional geographic space and one-dimensional time. Considering these arguments, it is intuitively attractive to apply various manifold learning methods that allow embedding into spaces of low and explicitly specified dimensionality. However, these methods are usually non-deterministic and have very large spaces of algorithm settings that are rarely explored extensively.

We demonstrated that various types of PCA algorithms (with different approaches to data pre-processing, with or without projection) and PCA interpretation approaches popular in population genetics are not robust to (random or clustered) sparse sampling of two-dimensional isolation-by-distance or isolation-by-resistance landscapes, to removal of rare alleles, and to LD pruning. In other words, correlation between geographic distances and distances in the low-dimensional PC space is typical for genetic landscapes, but it is easily eroded by these factors. We show that manifold learning methods such as UMAP, PHATE and others recover both global and local structure from such poorly interpretable PC spaces. However, this is achievable only if a suitable objective function is available, and hence parameters of these algorithms can be optimized deeply. We analyze several case-studies on large archaeogenetic datasets and conclude that it is dangerous to over-interpret PC spaces since in practice we very often deal with poorly sampled landscapes and have no access to rare genetic variation.