AI tool extracts cancer data from pathology reports

Feb. 20, 2020

To better leverage cancer data for research, scientists at the Department of Energy’s Oak Ridge National Laboratory (ORNL) are developing an artificial intelligence-based natural language processing tool to improve information extraction from textual pathology reports.

The project is part of a collaboration between the Department of Energy (DOE) and the National Cancer Institute (NCI) known as the Joint Design of Advanced Computing Solutions for Cancer (JDACS4C) that is accelerating research by merging cancer data with advanced data analysis and high-performance computing.

As the second-leading cause of death in the United States, cancer is a public health crisis that afflicts nearly one in two people during their lifetime. Cancer is also an oppressively complex disease. Hundreds of cancer types affecting more than 70 organs have been recorded in the nation’s cancer registries—databases of information about individual cancer cases that provide vital statistics to doctors, researchers and policymakers.

Through digital cancer registries, scientists can identify trends in cancer diagnoses and treatment responses, which in turn can help guide research dollars and public resources. However, like the disease they track, cancer pathology reports are complex. Variations in notation and language must be interpreted by human cancer registrars trained to analyze the reports.

Through its Surveillance, Epidemiology, and End Results (SEER) Program, NCI receives data from cancer registries, such as the Louisiana Tumor Registry, which includes diagnosis and pathology information for individual cases of cancerous tumors.

In a first for cancer pathology reports, the team developed a multitask convolutional neural network, or CNN—a deep learning model that learns to perform tasks, such as identifying keywords in a body of text, by processing language as a two-dimensional numerical dataset.

Words that have a semantic relationship—or that together convey meaning—are close to each other in dimensional space as vectors (values that have magnitude and direction). This textual data is inputted into the neural network and filtered through network layers according to parameters that find connections within the data. These parameters are then increasingly honed as more and more data is processed.

Although some single-task CNN models are already being used to comb through pathology reports, each model can extract only one characteristic from the range of information in the reports. For example, a single-task CNN may be trained to extract just the primary cancer site, outputting the organ where the cancer was detected such as lungs, prostate, bladder or others. But extracting information on the histological grade, or growth of cancer cells, would require training a separate deep learning model.

But the ORNL research team developed a network that can complete multiple tasks in roughly the same amount of time as a single-task CNN. The team’s neural network simultaneously extracts information for five characteristics: primary site (the body organ), laterality (right or left organ, if applicable), behavior, histological type (cell type), and histological grade (how quickly the cancer cells are growing or spreading).

To train and test the multitask CNNs with real health data, the team used ORNL’s secure data environment and over 95,000 pathology reports from the Louisiana Tumor Registry. They compared their CNNs to three other established AI models, including a single-task CNN.

During testing they found that the hard parameter sharing multitask model outperformed the four other models (including the cross-stitch multitask model) and increased efficiency by reducing computing time and energy consumption. Compared with the single-task CNN and conventional AI models, the hard sharing parameter multitask CNN completed the challenge in a fraction of the time and most accurately classified each of the five cancer characteristics.

Visit ORNL for more news