“Big Data” and the CBC-diff

What is “Big Data”—and how is it impacting the scientific method? In recent years, the expression “Big Data” has become a widely recognized term; however, many people don't fully understand its true meaning nor the impact it is having on science.

Big Data and the scientific method

Fernando Chaves, MD, serves as Director, Global Scientific Affairs, for Beckman Coulter Diagnostics

Human knowledge has evolved thru the scientific method, which is a structured way to observe reality and draw conclusions from it. Researchers make an observation, create a hypothesis based on this observation, and develop scientific studies to make more observations in a structured way to generate data which can be analyzed statistically to prove or disprove that hypothesis. This method has worked well up until now, but it has some key limitations. It worked because the hypotheses are generated based on previous observations, thus increasing the likelihood that any given hypothesis would be proved true in the end. Given the significant amount of time and effort needed to prove or disprove a hypothesis, scientists couldn't simply go “shooting in the dark." Even if they wanted to do so, they were often restricted by the fact that the number of variables which could be included in a study were limited, due to the fact that data collection was often manual—as were the statistical solutions available to test each variable.

In recent years, technological evolution has led to a profound change: the amount of data which can be seamlessly collected has increased exponentially, while at the same time new software makes it possible to analyze the data in a structured way at unbelievable speeds. Because of this change, the amount of data that can be collected is no longer a limitation, and the time and effort to analyze such data has significantly decreased. In simpler terms, scientists can now afford to collect vast amounts of data and go searching for patterns and associations they never thought of before. In effect, they can “shoot in the dark,” and instead of using data to test a previously created hypothesis, the process is reversed: it is the data analysis which is actually driving the creation of the hypothesis. This reversal of the scientific method is what defines Big Data.

Big Data is already being used in many industries: Retailers use it to stock stores; airlines use it to give travelers a smoother ride and save on fuel; social media use it to strategically place advertisements; investment bankers use it to buy and sell stocks. And now Big Data is starting to make its debut in medicine. In this scenario, pathologists and laboratory technologists should be asking the question: why not use Big Data in the laboratory? With the vast amount of cellular data collected every time a complete blood count and differential (CBC-diff) is performed, using the Big Data approach with data collected by hematological analyzers helps identify patterns associated with certain disease states, and thus could potentially enhance the diagnostic value of this inexpensive and readily available test.

Enhancing the diagnostic value of the CBC-diff

There are many reasons why Big Data could significantly enhance the diagnostic value of the CBC-diff. First and foremost, it should work from a clinical and biological perspective. Cellular changes occur in several disease states and should be detected at the CBC-diff level. Clinicians have been using these changes for several decades. However, the sensitivity and specificity is limited by the fact that the same change can occur in more than one disease process, while at the same time interfering factors may lead to an expected cellular change not occurring. Today clinicians are responsible for analyzing the CBC-diff results and interpreting those results in view of the broader clinical context for each individual patient. Again, this system has worked well, but for a rather limited number of clinical conditions. Using the Big Data approach, the sensitivity and specificity of the CBC-diff for the conditions where this test is already valuable both have great potential to increase, and the CBC-diff has the potential to aid in pointing to a number of additional diseases, where today it has limited diagnostic value.

From the workflow perspective, this approach is also ideal. The CBC-diff is the most widely used test in medicine, and is ordered in almost all clinical workups. Therefore, CBC-diff data is available for many patients for whom the correct clinical suspicion has not yet occurred, and thus specific confirmatory tests have not yet been ordered. This means the CBC-diff is positioned in the ideal situation for an initial indicator of the potential diagnosis. In other words, Big Data may give us the opportunity to move the needle, transforming what today are cell counters into potential diagnostic aids which may be critical to begin alerting a yet unsuspicious clinician to the possibility of a certain diagnosis.

Finally, from the cost-effectiveness perspective, the CBC-diff is a widely used and inexpensive test. Big Data algorithms for better clinical utilization of CBC-diff data therefore would add very little additional expense to healthcare systems, thus being very cost-effective as needed in the current healthcare economy.

Some cautionary words about Big Data

It must be noted, however that, as is true for any new technology, the utilization of Big Data in the hematology laboratory is not a silver bullet that can eliminate the need for well-trained professionals. The main value of this approach is the fact that, in theory, it could eventually be implemented as an automated, very low-cost initial screen for the disease states of interest. Having said that, this screen could only potentially add diagnostic value if it were complemented by subsequent confirmatory tests, including microscopic review of the peripheral blood smear. At this point, the expertise of pathologists and laboratorians will continue to play a critical role in the confirmation of the cellular changes suspected by the Big Data approach, and in the ordering of appropriate confirmatory testing.

Another key limitation of any potential Big Data automated screen is the need to consider the prevalence of the disease to which this approach is applied. For example, researchers at the University of Pennsylvania (Raess et al.) published a study describing a random forest classifier model using multiple CBC-diff parameters for the automated screening for myelodysplasia (MDS) in their laboratory.1

This model achieved an excellent area under the curve (AUC) at receiver operator curve (ROC) analysis (AUC=0.94), but given the very low prevalence of MDS in the general population, the positive predictive value (PPV) of the model was 7.3%. Therefore, the costs associated with following up cases identified by the Big Data approach will also play a role in its applicability for any given disease state. Raess et al. calculated the number of additional blood smears that would be required with the implementation of their model and determined that the impact would be minimal. For other disease states, when confirmatory testing would be more expensive and cumbersome, the cost-effectiveness of using an automated Big Data screen could tip in the opposite direction.

Big Data in laboratory medicine: more studies

The Big Data approach has already proved extremely valuable in laboratory medicine. In 2002, Rosenwald et al.2

used gene expression profiling in lymphoma tissue to describe how a common type of lymphoma (diffuse large b-cell lymphoma) could be divided into two sub-groups with distinct chances of survival after chemotherapy. They used a typical Big Data approach, analyzing the expression of messenger RNA in the tumors using more than 12 thousand clones of complementary DNA. Thanks to their work, the biological origin of these tumors was better understood (in Big Data the data analysis actually drives the generation of a hypothesis, not only tests it), and now patients newly diagnosed with diffuse large B-cell lymphoma can be classified in the correct sub-group and their therapies adjusted accordingly to optimize survival and minimize side effects.

The Big Data approach is now also being proposed in clinical medicine for early identification of various syndromes based on ongoing surveillance of patient data from electronic medical records, including clinical data, laboratory test results, hemodynamic information, and patient demographic information, among other variables. Kashiouris et al. performed a literature systematic review of papers discussing the diagnostic value of this approach, and identified 33 studies which met their criteria.3


The field of laboratory medicine has evolved significantly in the last decades, and today clinicians and patients enjoy a vast arsenal of tests and parameters which can be deployed to help them reach a diagnosis. Given these previous achievements, most laboratory tests today already have very good diagnostic performance, and our ability to improve this performance by developing new tests is therefore somewhat limited.

For this reason, our biggest opportunity for improving diagnostic outcome lies not in the search for new tests, parameters, or biomarkers, but instead via better utilization of data which already exists. This approach also makes perfect sense in terms of the current health-economic needs of healthcare systems, since it allows for a better utilization of available resources and information.

With more than two dozen parameters reported for every single sample, at a low cost, and being the most widely used test in medicine, the CBC-diff is very well positioned to potentially become the source of many new diagnostic applications proposed (with the proper regulatory approvals) via a Big Data approach to the use of these parameters.


  1. Raess PW, van de Geijn GJ, Njo TL, et al. Automated screening for myelodysplastic syndromes through analysis of complete blood count and cell population data parameters. Am J Hematol. 2014;89(4):369-374. doi:10.1002/ajh.23643. Epub 2014 Mar 13.
  2. Rosenwald A, Wright G, Chan WC, et al. Lymphoma/leukemia molecular profiling project. The use of molecular profiling to predict survival after chemotherapy for diffuse large-B-cell lymphoma. NEJM. 2002;346(25):1937-47.
  3. Kashiouris M, O'Horo JC, Pickering BW, Herasevich V. Diagnostic performance of electronic syndromic surveillance systems in acute care: a systematic review. Appl Clin Inform. 2013;4(2):212-224. doi: 10.4338/ACI-2012-12-RA-0053. Print 2013.Fernando Chaves, MD, serves as Director, Global Scientific Affairs, for Beckman Coulter Diagnostics.