Sanger sequencing has been a mainstay in clinical labs for years, where it is used to detect genetic variants for a wide variety of diseases and other phenotypes. This high-quality technology has been essential for accurate diagnoses of patients.
Today, though, the use of Sanger sequencing as a clinical standard is increasingly hard to justify. Technology development efforts have focused on other sequencing tools, with the result that Sanger instruments are often far more expensive and lower-throughput than newer options. However, they continue to be the primary sequencers in many labs because of their accuracy, read lengths, and familiarity. In some labs, next generation sequencing (NGS) platforms have been adopted as an alternative, but even in those cases Sanger remains an important tool for validating clinically relevant findings.
Another approach is now gaining traction in clinical labs: the use of long-read sequencing, such as single molecule, real-time (SMRT) sequencing, to replace Sanger technology for applications such as amplicon sequencing and validation of variants detected with NGS tools. Like Sanger, SMRT sequencing has very high consensus accuracy rates. Unlike other platforms, though, this kind of sequencing produces extremely long reads. Average read lengths are about 10,000 bases, which can easily span amplicon lengths used in traditional Sanger sequencing panels.
By contrast, short-read NGS technologies typically produce reads in the 200 bp to 300 bp range, necessitating algorithmic processes that stitch shorter amplicon sequences together to generate clinically useful information. Unfortunately, those processes can introduce errors, cause mismapping, and miss important genomic elements such as haplotype phase.1 Long reads allow scientists to sequence full-length gene alleles and capture important elements such as promoter regions, pseudogenes, and more.
In the clinical lab setting, long-read sequencing has the potential to be used on its own to analyze important regions such as the HLA locus or the CYP2D6 gene; it can also be used as a complement to NGS-based testing, serving as an orthogonal tool for validation of medically actionable findings. For either workflow, long-read sequencing can deliver lower project costs and higher throughput than Sanger sequencing. The technology also affords new opportunities to deliver clinically useful information for repeat expansion disorders and other diseases marked by significant amounts of structural variation.
Amplicon sequencing
Many clinical and translational research labs have developed protocols for amplicon analysis based on long-read sequencing. With this approach, scientists capture the genomic region of interest to generate sequence data that outperforms Sanger results in quality and accuracy.2
The CYP2D6 gene, which encodes an enzyme responsible for metabolizing 25 percent of commonly used drugs, was an early target for scientists interested in amplicon-based, long-read sequencing. Variants within the gene correspond to how a person will tolerate or metabolize many therapeutics, so analyzing CYP2D6 is a common task in clinical labs. The gene is highly polymorphic, with more than 100 known allelic variations, and has a nearby pseudogene with very high-sequence homology. Together, these features have made sequencing CYP2D6 a challenge, while genotyping-based tools only test for the most common alleles.
Scientists at the Icahn School of Medicine at Mount Sinai turned to SMRT sequencing as an alternative, finding that they could produce amplicons covering the gene and its copies as well as the pseudogene.3 A pilot project and follow-up studies showed that the long-read technology could not only sequence through the whole amplicon repeatedly, but also allowed for allele phasing to provide clinically meaningful information. Long-read sequencing was able to resolve discrepancies in samples that had produced inconclusive results with other sequencing or genotyping techniques. The scientists also discovered novel CYP2D6 alleles and structural variants during this evaluation study, even in samples previously examined with other analysis methods where those features were missed. They found that long-read sequencing data led to the revision of allele assignment for about 20 percent of samples analyzed, indicating that accepted protocols in clinical laboratories may be causing inaccurate calls for a significant proportion of patients.
Long-read sequencing has also been evaluated for use in HLA typing labs. Like CYP2D6, the HLA genes are complex and extensive; there are more than 13,000 alleles catalogued for the six HLA genes. At the Anthony Nolan Research Institute, scientists conducted a feasibility study to determine whether long-read sequencing could more successfully represent this group of genes, which are essential for matching donors and recipients for successful organ transplantation. In the project, they pooled full-length HLA class I genes from seven samples, sequencing each to at least 150-fold coverage, producing a mean quality value of 70 or better, and fully phasing alleles.4 Results were concordant with previous analyses of the samples, but the SMRT sequencing data also identified novel alleles. The process took three working days, less time than existing HLA typing methods.
Structural variation
Long-read sequencing has been proven to detect far more structural variation than NGS tools because its reads can span even large genomic elements. That makes it a good fit for use in repeat expansion disorders such as fragile X syndrome, Huntington’s disease, and many ataxias.
At the University of California, Davis, School of Medicine, scientists have demonstrated that long-read sequencing can be used to get through the FMR1 gene, which harbors the repeat expansion responsible for fragile X syndrome. Prior to this work, no sequencing technology had ever been able to completely characterize this region, which is marked by repeating CGG sequences—as many as 750 repeats in individuals with the disorder. Scientists not only produced the first full sequence of this region, but they have also been able to show that an accurate count of repeats is important for diagnosing an individual with fragile X or with other disorders that have fewer CGG repeats, typically between 55 and 200 copies.5 Research has proven that being able to identify two AGG interruptions among the CGG repeats is important for determining a woman’s likelihood of having children with fragile X syndrome. With long-read sequencing, it is now possible to fully sequence this region with accuracy high enough to pinpoint the two A changes in a sea of CGGs, generating information with tremendous clinical value.
Looking ahead
These studies suggest that long-read sequencing technology will serve as an effective, affordable replacement of Sanger sequencing for NGS variant validation, amplicon sequencing, structural variant detection, and more. The findings also indicate that SMRT sequencing will reveal more information about clinically important genes, genetic elements, and other regions than can be detected with existing lab tools, improving the accuracy and comprehensiveness of data reported back to physicians.
REFERENCES
- Ashley EA. Towards precision medicine. Nature Reviews Genetics. 2016;17(9):507-522. DOI:10.1038/nrg.2016.86
- Cavelier L, Ameur A, Haggqvist S, et al. Clonal distribution of BCR-ABL1 mutations and splice isoforms by single-molecule long-read RNA sequencing. BMC Cancer. 2015. Feb 12;15:45. DOI: 10.1186/s12885-015-1046-y.
- Qiao W, Yang Y, Sebra R, et al. Long-read single-molecule real-time (SMRT) full gene sequencing of cytochrome P450-2D6 (CYP2D6). Human Mutation. 2016;37(3):315-323.4.
- Mayor N, Robinson J, Alasdair JM, et al. HLA typing for the next generation. PLoS One. 2015. DOI:10.1371/journal.pone.0127153.
- Loomis E, Eid J, Peluso P, et al. Sequencing the unsequenceable: expanded CGG-repeat alleles of the fragile X gene. Genome Research. 2012. DOI:10.1101/gr.141705.112.
Kathryn Keho is a senior director at PacBio, provider of the Sequel System for long-read single molecule sequencing.