Coding-sequence determinants of gene expression in human cells
MetadataShow full item record
The human genome is highly heterogeneous in its GC composition. How codon usage affects translation rates has been extensively studied and exploited to increase protein expression. Although effects on virtually all other steps in gene expression have been reported as well, so far no systematic approach has been taken to quantitatively measure the contribution of each to overall protein levels in human cells. Here, I utilise a library of several hundred synonymous variants of the Green fluorescent protein (GFP) to characterise the influence of codon usage on gene expression in human cells. In an initial small-scale screen, I show that protein levels are largely correlated with codon-usage and particularly GC-content. Additionally, I demonstrate that these changes can already be seen on the RNA level, confirming more broadly previously published data from our lab (Kudla et al., 2006). In order to assess the consequences of randomised codon usage on a larger scale, I established and validated a high-throughput approach for the phenotypic profiling of reporter genes. Using a pool of cells stably expressing >200 GFP variants, I measured multiple parameters simultaneously, such as protein levels, translational state, RNA levels, stability and export. Data from these experiments confirm a strong relationship between GC-content, protein levels, as well as RNA export, reproducibly in two cell lines. Low expression of especially GC-poor variants could not be rescued by splicing, but increased nuclear-to-cytoplasmic RNA ratio, suggesting further mechanisms important for efficient gene expression. These effects are even more pronounced when the distribution of GC is spread evenly along the coding sequence. Interestingly, our data also suggests that high GC within the first 200nt is more predictive of efficient gene expression, contrasting studies performed on bacteria, in which strong secondary folding near the ribosomal binding site was shown to be non-permissive for translation (Kudla et al., 2009). By relating experimentally derived parameters to sequence features known to inhibit expression, I demonstrate that cryptic splicing is a major factor leading to decreased levels of particularly GC-poor GFP variants. An attempt to quantitatively assess the relative contribution of several sequence features (e.g. tAI, GC3, CpG) using multiple regression analysis lead to inconclusive results, leaving the requirement for the exploration of alternative approaches in order to dissect the role of individual parameters, as well as to identify novel determinants of gene expression.