Metrics of assay accuracy

March 21, 2017

In this month’s installment of The Primer, we’re going to cover something which applies equally well to non-molecular and molecular diagnostic assays: the use and meaning of some common metrics for assay accuracy. While this is likely a review of something all readers will have encountered at some time in the past, it’s a crucial enough concept to warrant attention for any of us who perhaps don’t recall quite what these metrics mean, and how they differ from each other. Such an understanding can prove useful in the minority of cases where a laboratory result seems at odds with other indications for a particular diagnosis.

Sensitivity and specificity

The first two metrics we’ll consider are sensitivity and specificity. These are perhaps the simplest of the metrics, as they are direct measurements of assay performance attributes which should in theory remain constant across all users (assuming identical instrumentation, reagents, and protocols). To address these terms, we’ll first need to embrace the concept of “true positives” and “true negatives”. These apply in the context of specific question with a Boolean (yes/no) answer, such as “Is Pathogen X present in Sample Y?” True positives are those cases where the answer is “yes,” and true negatives are those cases where the answer is “no,” from the context of some “omniscient” observer.

Readers who are of a more critical bent may immediately wish to point out there is no such omniscient data source that we can query in this situation. Such an argument is correct, and in reality, we shall have to settle for use of some “gold standard,” or accepted best accuracy reference method, as our surrogate for absolute yes/no truth in such matters. (The implications of this are something we’ll come back to later.) Other readers may also note that there is no room for “indeterminate” answers in our concept here, which is another divergence between theory and reality, at least in the case of those assays with defined “grey zones” which are not to be interpreted as evidence of either positivity or negativity. Pragmatists that we are, we will proceed regardless of these issues.

Sensitivity of an assay is then defined as the fraction (or percentage, if you prefer) of true positives which are detected as positive by the assay under a defined performance and interpretation method. On the surface, this is a readily grasped concept: if, for instance, we test 50 true positive samples and our assay calls positive on 49 of these, we would have a 98 percent sensitivity. One might even be forgiven for thinking that this is all one needs to know to establish one’s degree of trust in an assay result; however, as we’ll demonstrate later this is not the case. Nor is a claim of 100 percent sensitivity always reassuring.

Specificity is rather the inverse concept of this; it is defined as the fraction (or percentage) of true negatives which are detected as negative by the assay—again, under a defined set of performance and interpretation conditions. If we consider an example similar to the above, and run our assay on 50 true negative samples with results calling 48 of the 50 negative, we have a specificity of 96 percent. Again, this seems on the surface to be a simple concept.

One important point that needs to be made is that we must be simultaneously aware of both the sensitivity and specificity of an assay, with the values determined via the same protocol and interpretation, for the values to be meaningful. Either value alone can be wildly misleading, as will be demonstrated below. (Note that there are also the terms False Positive Rate, which is 1-[Specificity], and False Negative Rate, which is 1-[Sensitivity]; strictly speaking, then, the FPR [false postive rate] could stand in for specificity, and the FNR [false negative rate] for sensitivity in our ability to fully appreciate the performance reliability of an assay).

PPV and NPV

Other commonly encountered but more complex metrics for assay performance are the Positive Predictive Value (PPV) and Negative Predictive Value (NPV).  The PPV can be thought of as the likelihood that a positive test result means that the sample examined is a true positive, or the fraction of true positives over assay positive calls in a set. Similarly, the NPV is the likelihood that a negative test result is obtained on a true negative sample. A complexity of PPV and NPV is that unlike sensitivity and specificity, these are not hard and fast values that one can assume to be invariant between assay sites with identical assays and processes; in fact, they can’t even be assumed invariant over time at a single site with absolute uniformity in its assay method. That is because each naturally incorporates the target prevalence rate at the time the assay was performed.

In other words, and at extreme cases for simplicity of consideration, in cases of low prevalence the PPV will trend low due to the (fixed) probability of a false positive result becoming large relative to the actual probability of a sample being positive. NPV shows an equivalent relationship to prevalence. Thus, while PPV and NPV values are highly valuable snapshots of a time and place, an appreciation for whether the intrinsic target prevalence is similar between when the value was generated and the present is needed to evaluate their applicability.

The Magic Box

A useful exercise in putting all of these concepts together, and in a way which shows how they can be misleading if they are not understood, is by consideration of the Magic Box assay. This amazing assay—not available from any manufacturer, but within the “homebrew” capabilities of any lab manager—has a remarkable set of properties. It costs less than fifty cents in equipment and infrastructure; it requires no consumables; it requires only a few moments of operator training for expert operation; it has a turnaround time of less than 30 seconds; it is non-destructive of specimen material; although a simplex (single target) assay, it can be instantly reconfigured to test for any target desired; and it’s 100 percent sensitive.

Before you decide this is clearly the next must-have assay for your lab, let’s consider how the Magic Box Assay works, and how all of the above statements are absolutely true. To build the assay device, simply take a small cardboard box of your choice, remove the lid, and use a marker to draw a large “+” on the outside box bottom. Using the assay is also easy: place your specimen to test (still in its container) on the lab bench; place the Magic Box upside down over the sample; consider what target you want to assay for; and then look down at the box. If you see a “+”, the sample is positive, and you’re done and ready to go onto the next sample (or test for another target in the same sample).

Now let’s fill in some more details. Let’s say, for the sake of argument, that the actual prevalence of the target you “tested” for, in the sample stream under examination, is 70 percent. This is not an unreasonable value, either through the target currently being one that’s undergoing an outbreak in your patient population, or through the target being one with a highly characteristic presentation such that the requesting physician already had a strong suggestion of likely positivity, and is merely submitting the sample for confirmatory testing. (Of course, both of these can occur simultaneously, and during known outbreaks for a pathogen with characteristic presentation, the prevalence in submitted sample stream can be very high).

How does our Magic Box assay perform under these real-life conditions? Well, recall first that by definition of sensitivity—that  true positives are detected as positive—the advertising hype of “100 percent sensitive” is true; every time you put a positive sample under the box and looked down, you saw “+”! The PPV would be a little less reassuring, at only 70 percent, and should raise the first red flags indicating that results are likely to be wrong some 30 percent of the time. This is hardly the sort of confidence one would want out of a clinical diagnostic. Consideration of NPV and specificity are truly disturbing, as each would be zero percent; the Magic Box, after all, never reports a sample as negative.

There’s no real magic

In many ways, the above discussion is another reflection on a concept generally well understood by laboratorians: that sensitivity and specificity are something of a trade-off, with improvements to one causing losses in the other. Selection of assay cutoff criteria to find an acceptable “maximum utility” compromise between these metrics is frequently done through application of methods such as analysis of Receiver Operating Characteristic (ROC) curves. While these are a fascinating topic in their own right (and one which delightfully demonstrates how disparate branches of science can benefit one another; in this case, how rules for use of primitive Chain Home radar installations from the Second World War led to better medical tests), we are constrained by space from further consideration of the topic for now.

The take-home message from this month’s topic is thus that laboratorians should never accept just a sensitivity value on its own, as providing any real insight into the true utility of a test. The same of course goes for specificity; one could after all make a Magic Box with a minus sign on it, which could now in all truth claim 100 percent specificity; all true negatives would be reported as negative. If this message is one some readers think too obvious to need discussion, the author suggests they may wish to peruse the poster boards or overheard conversations at their next conference for evidence that not all people understand the crucial requirement for reporting all relevant assay performance metrics, rather than just a select few. Finally, recall our earlier reference to “absolute” positives and negatives; in reality, while gold standard assays can be highly accurate, it’s unlikely that any of them are truly 100 percent correct; our values for sensitivity, specificity, PPV, and NPV are thus all probably not exactly correct. This variance from exactitude is, however, likely of such small magnitude as to be of no practical concern, and is thus generally ignored.

John Brunstein, PhD, is a member of the MLO Editorial Advisory Board. He serves as President and Chief Science Officer for British Columbia-based PathoID, Inc., which provides consulting for development and validation of molecular assays.