In this month’s episode, we’re going to take a bit of a detour out of front-line molecular testing methods and delve a bit into something more theoretical but with impacts on molecular testing. It’s likely something you haven’t encountered unless you have a specialized genetics background, and it’s also something readers of this column may have wondered about in some form and may find of interest.
Mendelian isn’t everything: QTLs
Let’s start by considering something even the non-genetics specialists have some familiarity with—that is, simple Mendelian genetic traits. These are phenotypic (physical appearance or behavioral) traits which can be classified in discrete and mutually exclusive bins, like eye color. You might have green, blue, or brown eyes, but each of these is readily distinguishable—it’s not like there is a continuous rainbow spectrum of natural human eye colors. Other examples of discrete Mendelian traits in humans would be hairy pinna (earlobes) or widow’s peak (hairline). There are many more possible examples of less visibly obvious traits such as enzyme isoforms with discrete
Mendelian states having biochemically measurably different behavior. However, for a great many phenotypic traits we actually have a continuous spectrum of outcomes rather than discrete states. Think for example height, or longevity, or resting blood pressure. In fact, a great many of the phenotypic behaviors we’d like to know more about express themselves in this quantitative—as opposed to quantized—form. An example, which probably nobody cares about but which serves illustrative purpose might be, “rate of fingernail growth.”
The immediate complexity which springs to mind when we consider these types of measurements is that the end result observed in each individual is based both on genotypic and environmental factors, or “nature and nurture” as it’s sometimes called. A second level of complexity arises when we consider that these sorts of traits are likely to be influenced to a larger or smaller degree by multiple genes working in combination (a polygenic trait). Additional confounding issues could also possibly be epigenetic modifications to genes through mechanisms such as base methylation or histone acetylation influencing expression levels, or finally “variable penetrance” which is something of a catch-all term applied in genetics to cases where although a particular allele of a gene has a known effect, the scale of that effect is variable for reasons we don’t have a firm grasp on. What we’ll discuss in this month’s article is “QTL (Quantitative Trait Loci) mapping,” the approach taken to help identify genes and their alleles influencing quantitative loci of interest, with strategies to deal with our first and second confounding issues (and quite possibly our fourth issue). Epigenetic modification is a bigger topic and one we’ll leave out of the mix for now.
Step 1: Genetic markers and association statistics
Our first step in this puzzle is to have a relatively high density of randomly distributed genetic markers across the genome. These can take the form of any sort of identifiable “tag” which allows us to track its closely physically associated (linked) DNA. Single Nucleotide Polymorphisms (SNPs) are one of the most common type of such tag—single nucleotides at known locations in the genome, which exists in more than one form in the population. We might for instance note one location which is “A” in 70 percent of genomes making up our population and “C” in the remaining 30 percent. It doesn’t matter whether this is in a coding region or not (statistically, it probably isn’t) or whether the two alleles have any actual physical significance (even more unlikely)—all we care about is we now have a differentiable physical spot in the genome. As genetic recombination is a stochastic process, DNA sections near this marker stay attached to it more frequently than DNA sections further removed. If we have enough of these markers then we can track the movement of fairly small sections of DNA as they reassort through recombination and sexual reproduction to create individual genomes.
With these densely and randomly scattered markers on hand, we can now take sample individuals from our population and essentially do nothing more than look for statistical association of particular marker(s) with our trait of interest—in this case, rate of fingernail growth. We’re looking for one or more sections of DNA, whose inheritance seems to track with a measure of our phenotype, such that we can make statements like “With regard to this SNP, we observe faster fingernail growth in 97 percent of ‘C:C’ genotype individuals as compared to ‘A:A’ genotype individuals.” What we are actually observing is that there is some gene near to our marker, and at some time in the past on a chromosome which carried the allelic form of that gene which contributes to faster fingernail growth, there was a mutation at the nearby SNP transverting an A to C residue. (The temporal reverse is also possible, such that A and C SNP alleles came into existence and then the gene near one of these mutated. The result is the same, a functional gene allele is linked to a detectable marker.) Because these are physically closely linked, recombination between the marker and the allele of interest is rare and the marker now moves around as a surrogate for that allelic variant. Note that ‘rare’ doesn’t mean ‘never,’ and a first statistical value we get is strength of association (here, 97 percent) which also is a surrogate measure of what the actual distance is between the marker and the gene. If we’re particularly lucky we may even have two or more genetic markers which demonstrate this sort of linkage association and based on their relative frequencies of association, we can narrow down the general area of DNA the gene likely resides in based on relative closeness to these markers. Luck in this case is greatly increased as marker density increases, of course—high density marker maps make the entire QTL mapping process easier than lower density maps.
More statistics: effect sizes
There is an unrelated second statistic we should look for with our now-linked marker, and that is effect size. Imagine for instance when we looked for markers associated with fingernail growth rate, we found three widely separated markers with statistical relevance. All have one allele which clearly shows accelerated fingernail growth rate, but compared to some baseline reference value, Marker A shows a three percent increase in growth rate, Marker B shows a 22 percent increase, and Marker C shows a seven percent increase. (Strictly speaking the statistics would be more complicated than that, such as having a 95 percent confidence interval range of effect, but we’re ignoring these nuances here because they can—and do—fill entire textbooks. There are even multiple completely different statistical approaches to identifying our linked markers. For the sake of brevity, we’re avoiding all of that as the basic concepts as presented here remain valid across all these approaches.) These values suggest the relative scale of impact each gene has on the final phenotype.
And the candidate gene is…
Now armed with the knowledge of what regions have the biggest impact on fingernail growth rates, we can proceed (probably from the biggest impact markers to lowest) to look for likely candidate genes. If we find an ORF (open reading frame, a potential protein coding region) near one of our markers, and we find its translated amino acid sequence for example has a high degree of similarity to a known keratin synthase gene, that would make sense and be a good candidate. Our evidence for its involvement would get even stronger if we find this gene expressed as an RNA (or even just a fragment of it, in parlance an EST, Expressed Sequence Tag) telling us that it’s an active gene. If we could then find either amino acid variations in this candidate gene, or variances in its RNA expression level, which correlate with our phenotypic observations of fingernail growth rate, we can become increasingly certain that we’ve found our actual gene contributing to the continuous phenotype. A full proof would then best be done by targeted gene modification in whatever our model organism is for fingernail growth (the genetic equivalent of fulfilling Koch’s Postulate, by making the genetic change in an otherwise controlled background and environment and observing the expected effect). If there is no well established model system, then at least cloning of the variant gene forms, in-vitro protein expression, and enzyme kinetics studies can be nearly as helpful by demonstrating that yes, the protein version as coded for by gene found associated to the “C” SNP has faster enzymatic behavior than the version found associated with the “A” SNP. Follow these approaches through on the other identified linked loci and candidate genes to rule them in or out, and we have now fulfilled our lab’s lifelong academic ambition of understanding multiple genes influencing fingernail growth rate.
We mentioned above that this approach would attempt to address the challenge of “nature versus nurture.” Part of this is through the effect size values discussed above; generally, if the genetic effect is big enough, “nature” is more important than “nurture” and an effect is visible even with disparate environmental factors. Other approaches however can include things like linking in family pedigree information, where sibling studies can be done on the hopes that environments will be similar; things like dizygotic twins can be particularly useful here.
Alternatively, if there are suspected particular environmental factors, these may be accounted for in collected and paired metadata for each genetic subject. The second complexity—polygenic traits—was directly dealt with above, as we were able to identify multiple loci. The fourth complexity—our fudge factor of “variable penetrance”—to some extent may be explained away once we have our data in hand, as we start to show that all of loci A, B, and C are involved but we don’t see all of the impact of a particular allele at A unless we also have a particular allele at C. This is no longer variable penetrance, it’s a definable epistatic gene interaction. Finally, although we said we’d ignore the third complexity of epigenetics, it’s becoming increasingly possible from a laboratory technical perspective to capture data on things like DNA methylation during sequencing. Analogous statistical approaches to identify differential patterns of epigenetic labeling of particular gene regions correlating to phenotype are of course possible and will likely become more commonplace as the underlying data becomes more commonly available.
Having complete genomes of organisms in orders of magnitude is cheaper and easier than it was only a few years ago. Making sense of all of that information, by understanding how these genes contribute not just to discrete Mendelian traits but to all the complex polygenic continuously variable metrics we can imagine, is the next step in making good use of the data.
Hopefully, the forgoing has demystified the process for those of you not already familiar with it. Should it merely have wet your appetite for the subject, there are numerous good up to date texts on the subject available.