Can AI handle cancer care? New research tests limits of LLMs
Can artificial intelligence (AI) tools reliably handle the complexity and nuance of cancer care? A new systematic evaluation from Shanghai places that question under rigorous clinical scrutiny, comparing two prominent Chinese large language models in the context of ovarian cancer diagnosis and treatment.
In the study titled "Decoding AI Competence: Benchmarking Large Language Models (LLMs) in Ovarian Cancer Diagnosis and Treatment—A Systematic Evaluation of Generative AI Accuracy and Completeness," published in Diagnostics, researchers present one of the most structured head-to-head clinical assessments of generative AI systems in gynecologic oncology. Their findings reveal sharp performance differences between DeepSeek-R1 and Doubao-1.5-pro, while also exposing the broader limits of AI in complex medical decision-making.
The research team, based at the Shanghai First Maternity and Infant Hospital and affiliated institutions, designed a methodologically controlled benchmark to evaluate whether these models could align with established international clinical guidelines. Ovarian cancer remains one of the most aggressive gynecologic malignancies worldwide, with high mortality and intricate treatment pathways that demand precision across surgery, chemotherapy, genetic testing, maintenance therapy, and long-term follow-up.
To ensure objectivity, the researchers developed a 20-question evaluation framework grounded in NCCN, FIGO, and ESMO guidelines. The questions were divided into four equally weighted domains: Risk Factors and Prevention, Surgical Management, Medical Treatment, and Surveillance. Each model answered the same 20 standardized questions independently, with every response submitted in a new session to minimize interaction bias.
Five senior gynecologic oncology chief physicians then evaluated each answer on a 10-point scale based on accuracy and completeness. Scores above seven were categorized as clinically "Excellent." In total, 200 expert ratings were collected, 100 for each model.
DeepSeek-R1 delivers stronger clinical alignment
The results show a decisive advantage for DeepSeek-R1. Of its 100 individual expert ratings, 98 were classified as Excellent. All 20 of its responses achieved average scores above the seven-point threshold. In contrast, Doubao-1.5-pro received only 41 Excellent ratings out of 100, and only nine of its 20 answers achieved an average score above seven.
The distribution of performance across domains further illustrates the gap. DeepSeek-R1 achieved near-universal excellence in Medical Treatment and Surveillance, two categories requiring detailed knowledge of chemotherapy protocols, targeted therapies such as PARP inhibitors, immunotherapy considerations, recurrence definitions, and management of adverse effects. In these clinically intensive domains, it maintained consistent high-level scoring.
Doubao-1.5-pro performed comparatively better in Risk Factors and Prevention, where questions centered on BRCA mutation testing, family history assessment, and prevention strategies. However, its performance declined sharply in the Medical and Surveillance categories. In Medical Treatment, only 12 percent of its ratings reached the Excellent threshold, indicating major deficiencies in addressing complex therapeutic decision-making.
Statistical testing reinforced these observations. DeepSeek-R1 demonstrated no significant variation in median scores across the four domains, reflecting consistent performance. Doubao-1.5-pro, however, showed statistically significant differences between domains, suggesting uneven knowledge depth. Direct comparisons between the two models across all four domains revealed statistically significant differences, with the largest performance gap in the Medical Treatment category.
The radar-based performance comparison across the 20 individual questions showed that DeepSeek-R1 outperformed Doubao-1.5-pro in 19 out of 20 cases. Only one surgical protocol question saw Doubao achieve a marginally higher average score.
Strengths and structural weaknesses of AI in Oncology
While the study responses were generally detailed, logically structured, and closely aligned with guideline-based practice, the researchers identified specific inaccuracies and omissions.
In certain surgical eligibility scenarios, DeepSeek-R1 simplified indications without fully differentiating based on histological subtype. In another case, it applied staging language more narrowly than current guidelines require. For secondary cytoreductive surgery, it omitted a key eligibility condition. In its discussion of hyperthermic intraperitoneal chemotherapy, it constrained indications more strictly than guideline language specifies.
These errors were categorized as minor inaccuracies rather than dangerous misrecommendations. Nonetheless, the study underscores that even high-performing LLMs can embed outdated or oversimplified interpretations due to static training data. The authors attribute some discrepancies to insufficiently updated datasets, a recurring limitation in medical AI systems.
Doubao-1.5-pro displayed broader weaknesses. Many of its answers remained at the level of general medical explanation rather than professional clinical guidance. In high-risk areas such as maintenance therapy eligibility, platinum resistance definitions, and immunotherapy indications, responses lacked essential decision-making criteria. Although the model did not generate explicit harmful recommendations, omission of critical eligibility details in surgical and pharmacologic contexts was considered a serious limitation.
The study also highlights stylistic differences. DeepSeek-R1 often produced longer, more complex responses with structured subsections and evidence-based reasoning. While this enhanced completeness, it also introduced overly technical language that may burden non-specialist users. Doubao's responses were shorter and simpler but frequently lacked necessary clinical nuance.
Implications for AI in clinical practice
DeepSeek-R1 demonstrates meaningful potential as a supplementary educational tool and assistive clinical support system. Its strengths in risk assessment, surgical principles, treatment planning, and follow-up management suggest that high-performing LLMs can enhance information synthesis and guideline referencing.
However, the study firmly rejects the idea that LLMs are ready for independent clinical deployment. Human clinicians must retain ultimate responsibility for diagnostic and therapeutic decisions. Even minor inaccuracies in staging interpretation or treatment eligibility can carry significant consequences in oncology.
The research further calls for continuous model updating, integration of guideline-grounded retrieval systems, and multidimensional safety assessments. Developers must address hallucination risks, outdated references, and excessive verbosity. Moreover, models should be optimized not only for technical accuracy but also for clarity and human-centered communication.
The authors outline future research directions, including repeated-response testing to assess output stability, variation in question phrasing to evaluate adaptability, inclusion of leading international models such as GPT-4o and Claude for broader benchmarking, and comparative analysis with practicing oncologists to measure real-world performance gaps.
They also acknowledge limitations in their own study design. The evaluation included only 20 questions and five expert raters from a single institution. The seven-point excellence threshold reflects expert consensus but remains inherently subjective. The analysis captured performance at a fixed point in time and did not assess integration into live clinical workflows.
- FIRST PUBLISHED IN:
- Devdiscourse