Haplotype-based analysis methods for population genomics
This project will develop a series of computational tools that exploit the power of haplotype-based models for the analysis of population genomics data. The development of such tools is particularly important as advances in sequencing have now made it routine for sequence data to be gathered across full chromosomes. The multi-locus patterns of linkage disequilibrium that are present in haplotype data are informative about a range of important processes in population genetics. Leveraging the information in haplotypes is methodologically challenging, and for many specific problems the appropriate analysis tools do not yet exist. In response, our research will develop haplotype-based models in four major directions. First, we will develop haplotype-based models to infer recombination rates using genetic data from admixed individuals. The key principle is that ancestry switch points in admixed individuals can be used to infer recent recombination events. Our work will produce a software package for inference of recombination rates based on genome-wide single-nucleotide polymorphism data, and a separate simulation package for generating data with which to test the method. A key innovation will be developing and testing a version of this approach that can handle multi-way (>2 source population) admixtures. Second, we will use haplotype-informed approaches to improve the power of complex trait mapping approaches based on the evolve and resequence paradigm. The improvement in power will come from using haplotype information embedded in the raw read data from pooled sequencing experiments. Again we will develop both inference software and simulations to test the inference methods. Third, we will investigate to what extent purifying selection has shaped haplotype diversity in human populations. The expectation is that segregating deleterious variants will show reduced haplotype diversity, much as adaptive variants do. This signature has largely been unexplored and we will develop theoretical, empirical, and simulation-based approaches to establish whether this property exists and how it can be used to infer the strength of purifying selection in human population genetic data. Finally, we will derive a novel form of the conditional sampling distribution (CSD) for a haplotype. The application of CSDs in population genetics has been very fruitful, even though the approach is in its infancy. We will develop an approach that leads to a more accurate CSD. The new CSD will also open the door to extensions for computing haplotype probabilities in models with non-equilibrium demography and/or population structure. Throughout the project there will be an emphasis on software development for the broader population genomics community, and on overcoming computational and algorithmic challenges that arise commonly with haplotype-based models. The contributions are essential for pushing forward population genetics into the genomic era. Project Relevance This project will contribute to the basic toolkit population geneticists use to extract information from large genomic datasets and will enhance research on a number of applied areas with practical relevance. In particular we will develop tools that empower researchers to measure recombination, map complex traits, and understand the fitness consequences of human genetic variation. These areas are relevant to disease trait mapping, genetic disease etiology, and historical demography. Finally, we expect the algorithms developed will be useful either directly or with minor adjustment to closely related problems beyond those detailed in the project. As an example, our algorithms for haplotype frequency estimation in pooled sequences are closely related to problems for identifying the abundance of pathogenic strains in sequencing of blood DNA.