AI can decode medical records with near-human accuracy
Artificial intelligence is edging closer to a new role in medicine, acting as an intelligent interpreter of patient records. A new international study led by Jonathan Shapiro and colleagues has demonstrated that ChatGPT-4o, OpenAI's latest large language model (LLM), can reliably extract psoriasis treatment information from unstructured electronic medical records (EMRs) with near-human precision.
The paper, titled "One Step Closer to Conversational Medical Records: ChatGPT Parses Psoriasis Treatments from EMRs" and published in the Journal of Clinical Medicine, showcases how general-purpose AI can manage complex medical data without additional fine-tuning or specialized training.
The study's findings mark a significant advancement in healthcare automation and medical informatics. They suggest that LLMs could soon transform how physicians, hospitals, and researchers process, access, and analyze patient information.
How the study tested ChatGPT's medical competence
The research team analyzed 94 electronic medical records from psoriasis patients treated at Israel's Sheba Medical Center, one of the country's leading healthcare institutions. Each record consisted of free-text notes written in both English and Hebrew, reflecting the linguistic and clinical diversity of real-world EMR data.
ChatGPT-4o was tasked with identifying psoriasis-related treatments while ignoring unrelated medications. This required the model to understand clinical context, differentiate between current and historical prescriptions, and distinguish psoriasis-specific therapies from other drugs that patients may have been prescribed for coexisting conditions.
The researchers assessed ChatGPT's accuracy using standard performance metrics, recall, precision, F1-score, specificity, accuracy, Cohen's Kappa, and area under the curve (AUC), comparing its results to those of trained dermatologists who manually annotated the same EMRs.
The results were striking. ChatGPT achieved a recall of 0.91, precision of 0.96, F1-score of 0.94, specificity and accuracy of 0.99, Cohen's Kappa of 0.93, and AUC of 0.98. These figures show that the model nearly matched human experts in identifying and classifying psoriasis treatments.
At the treatment group level, ChatGPT performed best in recognizing biologic therapies such as Adalimumab and Methotrexate, and procedures like phototherapy, all achieving perfect F1-scores of 1.00. These results confirm that ChatGPT can correctly interpret both structured and nuanced terminology associated with modern dermatological care.
However, the model performed less consistently in identifying categories with ambiguous or non-specific documentation, such as systemic corticosteroids and antihistamines, where notes often lacked clarity or standard phrasing.
Why this matters for clinical data and AI in medicine
Electronic medical records hold a vast amount of unstructured data, notes written by physicians, nurses, and specialists. These records contain critical information about diagnoses, treatments, and outcomes but remain difficult to analyze systematically due to inconsistent terminology and varying documentation styles.
By successfully parsing psoriasis treatment data, the study's authors show that ChatGPT can help bridge the gap between unstructured clinical language and structured, analyzable data. The model's success hints at how healthcare systems might soon use AI to transform written medical information into dynamic, searchable insights.
This advancement has several practical implications:
- Reducing Administrative Burden: Clinicians spend an increasing share of their time on documentation. Automating data extraction could free valuable hours for patient care.
- Improving Research Efficiency: Structured datasets enable faster, larger-scale medical studies by allowing automatic aggregation of real-world evidence.
- Enhancing Decision Support: With structured information readily available, decision-support systems could provide clinicians with faster access to a patient's treatment history.
- Building Conversational Medical Records: The authors propose that this is a crucial step toward "conversational EMRs," where AI systems can interactively summarize patient data, respond to queries, and generate clinically relevant insights in real time.
Notably, the study highlights that ChatGPT-4o achieved this level of accuracy without fine-tuning on medical datasets. This finding demonstrates the model's zero-shot capability, its ability to perform domain-specific tasks based on general training, reducing the need for costly customization.
Challenges and limitations: Understanding what AI still misses
Despite its success, the research also exposed limitations that must be addressed before large language models can be safely integrated into clinical workflows.
First, the sample size of 94 records was relatively small, meaning results may not generalize across all specialties or institutions. Larger, more diverse datasets will be needed to validate consistency across medical contexts.
Second, ChatGPT sometimes confused temporal references, mistaking planned treatments for ongoing ones or interpreting past treatments as current. This limitation reflects the difficulty AI systems still face in understanding time-dependent relationships within clinical narratives.
Third, inconsistencies in EMR documentation affected the model's performance. Terms like "topical treatment," "cream," or "ointment" often lacked specific drug names, leading to missed detections. Similarly, differences between brand and generic names, for instance, "Humira" versus "Adalimumab", occasionally caused mismatches.
The authors suggest integrating domain-specific fine-tuning, drug dictionaries, and temporal tagging systems to improve performance in future iterations. They also recommend developing frameworks that allow LLMs to justify their reasoning, which would improve trust and interpretability among clinicians.
The future: From data extraction to intelligent medical conversations
Besides technical performance, the study points toward a broader transformation in medical documentation. As AI systems like ChatGPT evolve, they could form the foundation of interactive clinical records, databases that not only store information but also engage with clinicians through natural language.
Such systems could, for instance, summarize a patient's treatment history on demand, alert physicians to possible drug interactions, or highlight deviations from treatment guidelines. Over time, this capability could create "living records" that continuously learn from new data while improving accessibility and safety across healthcare ecosystems.
The study also hints at the potential for multilingual EMR processing, as ChatGPT handled English and Hebrew text with minimal errors. This capability could make it easier for international research collaborations and healthcare providers in multilingual environments to analyze and share data seamlessly.
While the findings apply specifically to psoriasis, the researchers emphasize that the same approach could extend to other chronic conditions such as diabetes, rheumatoid arthritis, or cardiovascular disease, any field that relies on long-term treatment monitoring and documentation.
- FIRST PUBLISHED IN:
- Devdiscourse