In-depth coverage: some useful NGS terms

Nov. 20, 2014

In the October 2014 issue of MLO, “The Primer” looked at some of the basics of next generation sequencing (“Next generation sequencing and library types,”). In this installment of The Primer, we’re going to pause a bit from purely technical issues and cover the meanings of two basic but critical and somewhat overlapping terms which clinical NGS applications will refer to: depth and coverage. As with any specialized technology, NGS has its own custom language. Understanding some of these key terms can help the clinician communicate more effectively with the translational research scientists in setting up an NGS protocol to provide useful information for the desired application.

Sequencing depth (also known as read depth) describes the number of times that a given nucleotide in the genome has been read in an experiment. Recall that in most NGS protocols, the genome (either whole genome or a targeted “panel” as covered last month) is fragmented into short sections of a few hundred base pairs. These are individually read and then bioinformatically overlapped or “tiled” to generate longer contiguous sequences making up the meaningful end data. At first impression, you might think that it’s only necessary to read each nucleotide position once in order to do this; however, for tiling to work, there need to be several individual reads with significant overlaps in order to line them up with any confidence. These overlap regions therefore of necessity have each nucleotide read more than once (Figure 1).

In our simplified example in Figure 1, we have areas with read depths of 4 (the last two nucleotides of Read 1 occur in all four reads); then going both left and right from this, areas of read depth of 3; then flanking that a depth of 2, and finally the ends have a depth of only 1. Notice that since these ends are only a read depth of 1, we would have no way to “attach” (tile) this into a larger genomic context. 

Figure 1

In this example, we could either describe the read depth of a single nucleotide, or measure the average read depth of every nucleotide position and get a value of 2.28 as an average read depth. For an NGS experiment, it’s often meaningful to describe both the average read depth as a measurement of general completeness of the data set, and the specific depth of a single point of interest (such as a diagnostically useful single nucleotide polymorphism (SNP).

Since the library fragmentation process generating the reads is random in nature, we have to actually generate a large number of fragments to know that all areas will be represented by overlaps and able to tile onto flanking regions; thus, average read depths need to be quite high to accurately reassemble long contiguous reads. (For the sake of brevity, we’re ignoring both telomeric regions and long highly repetitive regions such as around chromosome centromeres; these obviously are special cases and in fact pose particular challenges for NGS protocols. Conveniently for the clinical NGS user, though, these regions are rarely of interest.) 

A second important reason for having relatively high average read depths is to ensure accuracy of the final sequence. During each of the actual massively parallel short sequencing reactions, errors at individual base positions are possible and in fact do occur at a finite rate. If you look closely at Figure 1, note that one position (in lowercase) in Read 2 doesn’t match Reads 1 and 3. The tiling process actually uses the best alignment of individual reads, even allowing for a small frequency of mismatches, and the most common or “consensus” sequence of the tiled reads is taken to be the correct sequence. The higher the depth of read coverage, the better statistical strength this consensus sequence has, as correct reads at any individual position increasingly outnumber individual read errors.

What sort of read depth is needed for an NGS experiment? In fact, this will depend on the purpose of the experiment and type of sample used, but as a very rough generalization an average read depth of about 20 is considered adequate for human genomes. Note this is the average read depth, and that at this depth some regions of the genome will be underrepresented and some will be read in more depth. Part of the bioinformatics data associated with the tiled consensus sequence generated is a measure of certainty, or quality, at each nucleotide position to give the end user some sense of the level of statistical certainty for each base position.

The above has considered the case in which there is a genetically uniform input material. Let’s examine the case, too, in which the sample of interest may have some regions of genetic heterogeneity, such as a mixture of normal and tumor cells. In this case, simplistically speaking, the reads will contain two common nucleotide calls at the heterogenous position, in a ratio closely approximating the ratio of the two cellular types (normal and cancerous) in the sample used to generate the sequencing library. That is, if the sample was 80% normal cells and 20% tumor, the reads would be about 80% the non-mutant nucleotide, and about 20% mutant. I write “about” because each of these could still contain random read
errors as well. 

The bioinformatics process is able to sense and mark particular genetic positions like this, where a high enough frequency of a second base call at a position suggests this is an actual mixture of genotypes, and not just random sequencing errors. In addition to single nucleotide polymorphisms, this process can detect insertions and deletions (collectively, copy number variations, or CNVs) essentially by observing that some regions generate many fewer or many more individual reads than the bulk of the rest of the genome. In fact, for RNA-based starting libraries (RNA-Seq) applications, the number of reads which can be mapped back to individual transcripts can be calculated and compared as a way to measure differential expression of the underlying genes and even their different isoforms, which is one of the most powerful aspects of this particular NGS tool.

Taking the depth idea still further leads to the concept of (and term) “deep sequencing,” which is where a sample (or genetic region of a sample) is intentionally read to much higher read depths than the 20 or so mentioned earlier. By doing this, it is possible to detect quite rare mutations in a nearly homogenous population and still be able to differentiate these low-abundance sequence variations from random read errors. Deep (and “ultra-deep”) sequencing can go up to depths of over 100 reads per nucleotide, although the use of the term and its boundary to “normal” read depths is somewhat ambiguous and these terms are sometimes applied even below the 30 reads/nucleotide boundary.

Our second term is coverage, and it’s a concept which is closely connected with depth. In fact, it’s sometimes used in exactly the same way that I have described depth above, as a short form for “depth of coverage.” It is, however, also used to mean “breadth of coverage,” or a measure of what proportion of the total intended genome is represented in the data set to at least some depth. In effect, this is a measure of the length of the tiled output consensus sequence, over the expected size (length) of the starting library, or more formally, the “assembly size”/“target size.” Used in this way, depth and coverage are, in effect, competing aspects of where the total amount of data in an NGS experiment distribute. For a fixed number of total reads, there can either be more reads at the same areas (leading to more depth) or fewer reads per area but distributed across a wider range of the input material (leading to more coverage). When designing an NGS experiment to answer a clinical question or questions, understanding this can help in tailoring the design and bioinformatics tools to make the most meaningful balance between these two.

Hopefully, the foregoing has given the reader a basic appreciation for these two key terms, and for the complexity of the underlying bioinformatics in tailoring an NGS application to a particular use. This complexity means it’s crucially important to have good bioinformatics support at the application design stages, where decisions on platforms, number of reads, desired ability to detect heterogeneities at a given frequency, and statistical strength required from the end results all play a role in developing an effective test strategy. With the above terminology in hand, the reader will be better equipped to effectively communicate with the NGS specialist scientists in the effective application of this powerful tool for research and personalized, genomics-based medicine. 

John Brunstein, PhD, a member of the MLO Editorial Advisory Board, is President and CSO of British Columbia-based PathoID, Inc.