Recent identity by descent in human genetic data - methods and applications
MetadataShow full item record
The thesis describes algorithms for detecting regions of recent identity by descent (IBD) from human genetic data and its applications in optimising resequencing studies, genomic predictions and detecting Mendelian subtypes of diseases. Firstly, we describe the algorithm ANCHAP, which scans pairs of multi-point SNP genotypes for sharing IBD of long haplotypes. A comparison with other methods shows that ANCHAP outperforms them in terms of speed or accuracy. We demonstrate the algorithm on data from population isolates - from Orcades, Croatian islands, and from a population of unrelated individuals. We compare the abundance of IBD segments between cohorts, and identify genetic regions where IBD is most common. Secondly, we verify the IBD regions detected from array data against exome sequence data. We estimate that where sharing IBD between a pair of individuals is inferred, this is confirmed by exome data in 98% of cases. Correctness of IBD detection varies with settings of ANCHAP, length of IBD segments, and position with respect to segment endpoints. We find that with sample sizes of 1000 individuals from an isolated population genotyped using a dense SNP array, and with 20% of these individuals sequenced, 65% of sequences of the un-sequenced subjects can be partially inferred. Implementation of such resequencing strategies requires an IBD-based imputation algorithm, which is outlined. Thirdly, we use recent IBD to detect carriers of Mendelian subtypes of colon cancer. We show this with the example of Lynch syndrome, which accounts for about 3% of colon cancer patients. We detect IBD sharing between known and unknown carriers around DNA mismatch-repair genes. Using the IBD relationship, we build and evaluate a model that predicts presence of Lynch Syndrome mutations. Finally, we discuss whether regions of identity by descent can be used for genomic predictions. We conclude that the utility of the inferred IBD regions depends on accuracy of detection, time to most recent common ancestors and mutation rates since.