RAPD for strain identification

Sept. 25, 2018

Imagine you’re in a public health lab setting, with a significantly large catchment area. Among the various things you test for in your lab are a number of bacterial and viral pathogens which on the whole aren’t that rare; for any one of these, you expect to see at least a handful of positive cases per year. Some of them may be likely to be seasonal—“Remember folks, don’t make the chicken salad in the unwashed bowl you just thawed that expired on sale chicken in!”—and on top of that, there’s always going to be stochastic variation in the number of cases reported. For sake of simplicity let’s pick (pick on?) a particular example pathogen here—say Salmonella in keeping with the spirt of our maligned chicken salad above; but everything we’re about to discuss applies equally well to any other organism. With this scapegoat in hand, let’s say you get two positive cases one month, and 12 another, but now you just had seven cases in three days. The question that occurs to you is, “Is this an outbreak—that is, something from a single common source—or is it just an unlucky coincidence of unrelated cases?”

If all of your seven cases just happened to attend the same family reunion picnic immediately before presenting, it’s not a big stretch of the imagination that they’re related; but what if they occur in folks scattered around the city without an immediately apparent common link? In trying to determine if these have a common source, a more direct question we can answer with the materials on hand is, “Are the bacteria isolated from the different patients all the same strain?” The underlying logic here is that while Salmonella may not be incredibly rare in the environment of an area, usually there will be multiple slightly different strains in circulation and if you just have an unlucky statistical fluke, your multiple cases will be a selected assortment of these strains. If on the other hand they all appear isogenic, you are much more likely to have a single source to identify and resolve. The question thus is, “What can I do to get a genetic fingerprint on these samples to see if they all match?”

Genetic fingerprints track organism identities

Note the colloquial term “genetic fingerprint” here—it’s a term commonly used by analogy, and probably because it’s a pretty good analogy. A fingerprint in its literal context is a unique biometric marker which can uniquely identify an individual; however it doesn’t tell you anything about the associated individual’s phenotype. That is, you don’t know if they have blue eyes or blonde hair or perfect pitch; but if you can get their fingerprint and match it, you know you have the right person. A genetic fingerprint is much the same in that it’s a unique biomarker—since our Salmonella replicates clonally, each strain is uniquely identifiable. As spontaneous mutations occur and propagate, new strains establish and are distinguishable, but the “genetic fingerprint” won’t directly tell us if there’s any different growth characteristics or virulence behaviour; it will just tell us they’re genetically distinct strains.

One elegantly simple method applicable in these sorts of scenarios for the generation of genetic fingerprints from your patient derived Salmonella isolates is known as Random Amplification of Polymorphic DNA (RAPD). The term “Random” is a bit of a misnomer here; the key to laboratory science (and this method) is that it’s reproducible which by nature is the opposite of random. Its use comes from imagining you have the Salmonella genome spread out in front of you, and pick a single random point one strand the genome. What’s the probability that the nucleotide you pick, is an adenylate (A)? Well, disregarding any %GC compositional bias, the answer is 25 percent—there’s an equally good chance it could have been a T, C, or G. On the scale of biological numbers—millions of bases in a genome—one in four is a pretty good probability. OK, so now move to the next 3’ nucleotide, and ask yourself, what’s the probability it’s an (insert your favourite nucleotide here)? Again, it’s one in four. In fact, you realize that for an oligonucleotide primer of length N, the probability of finding that sequence somewhere in any random DNA genome is 4^N. At a 10 base pair primer length, 4^10 = 1, 048, 576, meaning that on average any 10 base primer sequence occurs in any genome about once a million base pairs. Of course average here means that sometimes it may occur twice within 150 base pairs, and then not at all for 3 million; and this begins to be the key idea behind RAPD.

Where’s the “Random” come from?

RAPD is done by picking one or more short primers of “random” sequence because it’s not quite random; homopolymer runs are a bad idea, and some particular sequences may be known to occur repetitively throughout a genome, so you’d ideally avoid both. Though in general, something that looks like a jumble of all four nucleotides is good—AGTTACAGGA or GTACAGGTCG—for instance. If you don’t like those, make up your own—they’ll all work.

To use them, you take extracted DNA from your sample of interest, and set up a “normal” PCR reaction; but in the simplest form of RAPD, you only put in one PCR primer, and it’s one of your randomly chosen 10mers. Now, thermocycle away as you normally would (bear in mind the reaction annealing temperature will be quite low, to compensate for the low melting point of a 10-mer primer; often in the 35-40C range). Let the PCR go to completion, and run the product(s) out on an agarose gel. What do you see?

What you see will depend on the genome you ran the reaction on. Recall that for PCR to work, you need two primers on opposing strands (opposite polarity, that is, with growing 3’ ends pointed towards each other) and that they should be anywhere from about 50-3000 base pairs apart; any shorter and the product isn’t readily visible, and any longer and the individual thermocycles aren’t long enough to allow the polymerase to reach from one priming site to the next. With this in mind, you’ll see that any place in the template DNA material where two copies of our “random” priming site occur, meeting these criteria, will give rise to a PCR product visible as a band of discrete size.

With a big enough genome, there are often multiple such occurrences, giving rise to a characteristic pattern of bands for a certain RAPD primer on a certain clonal DNA template. Change the RAPD primer, and a different pattern is generated. The crux of the matter though is that the genetic changes giving rise to different strains—be they insertions, deletions, inversions, or even point mutations (if they occur in a primer site) can change the appearance of these patters by adding or removing bands, or changing the size of bands. Our “fingerprint,” therefore, is the distinctive pattern of bands produced by a particular RAPD primer during a PCR reaction under fixed conditions on a specific genotype template.

Of course, some of you have probably considered already that if a random sequence 10-mer (realistically, about the shortest oligonucleotide which acts effectively as a primer) only occurs once per ~1 million base pairs, then for small bacterial genomes, we’re not likely to get a lot of good band patterns from a single primer, and in fact a fair number of 10-base primer candidates wouldn’t yield any product bands at all, as they just won’t happen to have two suitable priming sites. Fear not, we have a simple solution. No bands?

Throw in a second, different 10-mer and see if the pair can together create any bands. In fact nothing stops you from going to three or four primers in the mix, until you reach enough combination diversity to start generating a set of bands from the available template. What’s critical to realize is that in any case the banding pattern will still be reproducible, with a specific product band pattern occurring with a single strain of material as long as the reaction conditions and primer(s) employed stay fixed.

Other information we can extract

Can RAPD results tell us not just if two samples are different, but how different they are? Absolutely. This sort of information is best gathered if several bands are present, and/or multiple separate RAPD primers or primer set reactions—each with its own characteristic result pattern—are generated. In this case, if you imagine a parental strain and a divergent offspring strain with a small amount of genetic change, you can imagine that one or maybe a few of the bands are changed, with most of the band pattern remaining shared between the two strains.

As further divergence from the original strain occurs, there’s more and more accumulated changes in the banding pattern. By scoring the bands as heritable genetic markers, it’s relatively straightforward to mathematically score degree of relatedness between samples by RAPD, and even to infer phylogenies (that is, patterns of strain origin; which were the original strains and which are the descendant strains in turn).

Once grasped, it’s hard not to be impressed at the elegance of RAPD. It requires absolutely no prior knowledge of the organism to be tested, can be applied to any genome, and can be done with the absolute bare minimum of lab equipment. All of this comes with some practical costs though. Firstly, the template material must be purely from one species and strain, not a mix of DNA from host, pathogen, and “other things”—any sort of mixing of “other DNA” into the template material will lead to appearance of extra bands with no way of knowing they originate from off-target material. If we’re examining our Salmonella samples, we’d want to do it off individual plate colonies. It’s also not readily amenable to automation (much of the band scoring is subjective), and it’s not readily adaptable to very high throughputs. For small numbers of occasional samples to be run in a reference lab setting it is however a cheap and handy tool to replace pulsed-field agarose gel electrophoresis (PFGE) methods for strain typing and answering our opening query, “Were these cases linked?”