The effectiveness of cancer treatment is most commonly determined using radiology reports that incorporate bioimaging results from computed tomography (CT), X-ray imaging or Magnetic Resonance Imaging (MRI) scans. For decades, these techniques not only shed light on the effectiveness of treatment, but also guided clinical decision-making for patient care and management.
The volume of radiological scans and their reports has increased, providing much potential value and insight to medical research. However, much of this clinical data still remains in free text, which has limited the speed at which real-world insights can be discovered to improve and advance patient care and research.
Now, advances in AI technology are providing new avenues to streamline image analysis and results reporting, with advances promising to not only ease the workload of oncologists, but also provide more accurate and objective analysis
for patients.
Automated analysis of bioimaging results has been pursued in recent years. Early efforts in this area utilised convolutional neural networks and natural language processing (NLP) models to assess disease response from radiology reports, and infer progression-free survival. This is a clinically relevant endpoint that indicates the length of time during and after treatment of a disease that a patient may survive without further progression of the disease.
Working to advance the field further, and evaluate the relative performance of newer large language models against previous cancer disease response classification methods, a team of researchers led by Professor Ng Hwee Tou assessed multiple models, with model accuracy as the chosen performance metric. Their work was published in the Journal of the American Medical Informatics Association in July 2023.
Methodology to evaluate large language models against cancer disease response classification methods, validate GatorTron transformer model and improve model performance with data augmentation.
In this work under the National Medical Research Council Industry Alignment Fund Pre-Positioning Fund (IAF-PP) CaLiBRe project, more than 10,000 deidentified radiology reports were retrieved for patients seen at the National Cancer Centre Singapore (NCCS), with colorectal, lung, breast, and gynaecological cancer of all stages. The team randomly divided the radiology reports into training (80%), development (10%), and test sets (10%) while preserving the representation of all 4 tumour types. The conclusion section of a radiology report was used as the input, while disease response was used as the target output in the experiments.
Out of the several models tested, the team found that the
best-performing model with the highest model accuracy was the GatorTron transformer.
Response Evaluation Criteria in Solid Tumors (RECIST) is a set of standardised guidelines commonly used to assess tumor response in cancer patients undergoing treatment, particularly in clinical trials and research settings. The team further tested the GatorTron transformer model on a set of RECIST-based datasets for validation. It was found that the model achieved an accuracy of 89.19%, which suggests that the model performed well in correctly classifying or predicting disease responses according to RECIST criteria and points to the model’s potential in clinical trials. In addition, the team also explored the effects of data augmentation techniques on model performance. Here they found that when using sentence permutations, new synthetic radiology reports could be generated to augment training data and increase the number of training samples. This further improved the accuracy of the GatorTron model to 89.76%. It was also observed that data augmentation improved the accuracy of all models.
Finally, the team explored prompt-based fine-tuning —
a technique used in NLP to tailor a pre-trained language model, such as the GatorTron model — to specific tasks or domains by providing task-specific prompts during the fine-tuning process. In these experiments, it was found that prompt-based fine-tuning with GatorTron achieved a relatively good accuracy of 86.09% after training on only
500 reports, while GatorTron without prompt-based
fine-tuning had an accuracy of only 54.64% after training
on 500 reports.
In conclusion, Prof Ng’s research demonstrated the feasibility of using deep learning-based NLP models to infer cancer disease response from radiology reports at scale. Large language models based on transformers consistently outperformed other methods, with GatorTron being the best-performing model for this NLP task. Data augmentation is useful to improve performance while prompt-based fine-tuning was shown to significantly reduce the size of the training dataset. Medical researchers currently use these models to derive time to cancer disease progression in analyses of large cancer datasets, speeding up the process of cancer registry building to uncover new observations, such as the effectiveness of cancer drugs in various real-world patient subgroups, and provide novel hypotheses for subsequent research. With further research, it is hoped that such models may one day also be considered as a clinical decision support tool for clinicians, by providing an automated second opinion of disease response for radiology reports, highlighting reports with disease progression to ordering physicians for earlier attention and review.
References
Tan, R. S. Y. C., Lin, Q., Low, G. H., Lin, R., Goh, T. C., Chang, C. C. E., ... & Ng, H. T. (2023). Inferring cancer disease response from radiology reports using large language models with data augmentation and prompting. Journal of the American Medical Informatics Association, 30(10), 1657-1664.