Multicenter, Head-to-Head, Real-World Validation Study of Seven Automated Artificial Intelligence Diabetic Retinopathy Screening Systems
Research Design and Methods: This was a multicenter, non-interventional device validation study evaluating a total of 311,604 retinal images from 23,724 veterans who presented for teleretinal DR screening at the Veterans Affairs (VA) Puget Sound Health Care System (HCS) or Atlanta VA HCS from 2006 to 2018. Five companies provided seven algorithms, including one with FDA approval, that independently analyzed all scans, regardless of image quality. The sensitivity/specificity of each algorithm when classifying images as referable DR or not were compared to original VA teleretinal grades and a regraded arbitrated dataset. Value per encounter was estimated.
Results: Although high negative predictive values (82.72%-93.69%) were observed, sensitivities varied widely (50.98%-85.90%). Most algorithms performed no better than humans against the arbitrated dataset, but two achieved higher sensitivities and one yielded comparable sensitivity (80.47%, p = 0.441) and specificity (81.28%, p = 0.195). Notably, one had lower sensitivity (74.42%) for proliferative DR (p = 9.77x10-4) than the VA teleretinal graders. Value per encounter varied at $15.14-$18.06 for ophthalmologists and $7.74-$9.24 for optometrists.
Conclusions: The DR screening algorithms showed significant performance differences. These results argue for rigorous testing of all such algorithms on real-world data before clinical implementation.