posted on 2025-10-30, 18:26authored byYaguang Zheng, Yulin Song, Eduardo Iturrate, Bei Wu, Susan Zweig, Stephen B. Johnson
<p dir="ltr">Objective: Continuous glucose monitoring (CGM) is essential in diabetes care and research; however, extracting key data (e.g., time above, in, or below range) from CGM reports is manual, time-consuming, and inefficient. Natural language processing (NLP) can extract data from unstructured sources (e.g., images), but its application in CGM remains unexplored. We aimed to evaluate the accuracy of extracting CGM data using NLP. </p><p dir="ltr">Research Design and Methods: We analyzed CGM reports stored as PDF from the electronic health record at New York University Langone Health. The steps of our algorithm pipeline consist of: 1) performing optical character recognition (OCR) to obtain glucose matrix data from CGM reports; 2) determining the type of CGM documents based on keywords in OCR results; 3) extracting variables of glucose based on CGM document type; and 4) storing the extracted glucose data in a structured database. Two experts with experience in CGM research and clinical practice conducted an independent manual review of 1% of the documents (n=226). We calculated accuracy (correct extraction of CGM data ) by comparing the algorithm’s results with the manual review. </p><p dir="ltr">Results: Of the documents analyzed, 36.8% were Freestyle Libre and 63.2% Dexcom. For information extraction, the agreement in evaluating Libre results between two experts was 99.93%. When comparing algorithm accuracy with manual review, the accuracy for Libre was 99.87%, and for Dexcom was 100.00%.</p><p dir="ltr">Conclusion: Using an NLP approach to extract valuable glucose data from CGM PDF files is feasible and accurate, which can benefit clinical practice and diabetes research.</p><p><br></p>
Funding
This study was supported by the New York Regional Center for Diabetes Translation Research (NY-CDTR) Pilot and Feasibility (P&F) Program Funding (P30DK111022-08).