Inferring strength of selection in vertebrate genomes
Protein-coding sequences have long been assumed to evolve under selection, but the quantification of the process at the nucleotide sequence level only started when a simple null model, the neutral theory of molecular evolution, was formulated by Kimura. Several methods were developed, which were based on the assumption that synonymous sites (nucleotides at third codon positions which do not change the encoded amino acid) evolve close to neutrally, and could be used as local neutral standards. Most of our current knowledge on the direction and strength of selection still depends on this simple assumption. One method, notably the non-synonymous to synonymous substitution rate ratio (dN/dS) has gained prevalence and is still widely used, in spite of the growing body of evidence that synonymous sites evolve under selection. In this thesis, I quantify the strength of selection in different sequence compartments of mammalian genomes, in order to obtain estimates of their functional importance from comparative genomics analyses. I quantify the fraction of mutations that have been selectively eliminated since the divergence of the species pairs examined, the so called genome wide selective constraint. This in turn is used to approximate the genomic deleterious mutation rate, which is an important parameter for several evolutionary problems. As estimates of selection depend on a large extent on the chosen neutral standard, here I use orthologous transposable elements, so called ancestral repeats, as these have been found to be evolving at a largely neutral fashion, and contain the least number of constrained sites in mammalian genomes. This enables me to quantify the level of selection even at synonymous sites, and the results suggest that these sites indeed evolve under constraint, the consequences of which I discuss. The selective constraint estimates enable me to test some simple hypotheses, such as Ohta's nearly neutral theory of molecular evolution, which suggests that selection is more efficient in species with larger effective population sizes. Beside the choice of neutral standards, there are several additional factors which are known to affect the selective constraint estimates. Here I also test the consequences of one of these, notably when sequences are not at compositional equilibrium (i.e. their GC content is away from the equilibrium GC content), which predicts that sequences with different GC content should evolve with different rates. This can cause bias in the estimates of level of selection or can even imitate selection in sequences which evolve completely neutrally. This effect is quantified here, and a simple correction is discussed.