Background
Despite well-defined criteria for radiographic diagnosis of atypical femur fractures (AFFs)1, misdiagnosis is common. An AFF diagnostic software could provide timely AFF detection to improve their management and prevent progression of incomplete/contralateral AFFs.
Objective
Develop a semi-supervised artificial intelligence (AI)-based application using deep learning models (DLMs) to train algorithms to diagnose AFFs from femur X-rays.
Methods
Pre-operative complete AFF(cAFF), incomplete AFF(iAFF), typical femoral shaft fracture(TFF), and non-fractured femoral(NFF) X-ray images in anterior-posterior view were used. AFFs were defined as per 2014 ASBMR case definition1. Fractures were labelled using bounding boxes in Conda. All images were used to train and test the model using a 5-fold cross validation approach. Convolutional neural networks (CNNs) were trained to identify AFF diagnostic features. The DLMs were built using a pretrained (ImageNet dataset) ResNet backbone with the proposed Box Attention Guide (BAG) module. The model’s attention beta was visualised. Precision (result relevancy), recall (prediction performance within a category), and F1 score (precision-recall, overall prediction performance) were measured.
Results
The dataset included 2015 radiographs from 1014 patients. The number of cAFF, iAFF, TFF and NFF radiograph labels were 213, 49, 394 and 1359, respectively. The model achieved high precision, recall and F1-score for classifying cAFF X-rays (96%, 94%, and 95%, respectively), while iAFFs were detected with 86% precision, 82% recall and an F1-score of 83%. High precision, recall and F1-scores were also achieved for classifying TFFs (96%, 97%, 97%, respectively) and NFFs (99%, 99%, 99%, respectively).
Conclusion
A DLM trained on femoral X-rays was able to classify cAFF, TFF, and NFF X-rays with excellent precision and accuracy. Accurate AI-based AFF diagnostic software has the potential to improve AFF diagnosis, reduce radiologist error, and allow urgent intervention, thus improving patient outcomes. Further research to validate this model in a larger, well-phenotyped dataset is underway.