The claim that an AI system has surpassed human expert performance in a medical imaging task is one that demands careful scrutiny. When the results come from a well-designed, pre-registered study conducted across multiple independent academic medical centers, with a blinded comparison against board-certified radiologists, they deserve to be taken seriously. That is exactly what a new study published in partnership with Anthropic and three major US hospital systems finds for Claude 4 Opus on chest imaging interpretation.
The study evaluated Claude 4 Opus on 2,400 de-identified chest X-rays and 1,800 chest CT scans sourced from routine clinical workflows across Johns Hopkins, UCSF Medical Center, and Mass General Brigham. The cases covered a representative mix of findings including pneumonia, pulmonary embolism, pleural effusion, pneumothorax, and lung nodule detection. Each case was also evaluated by two attending radiologists working independently, and by a panel of subspecialty thoracic radiologists whose consensus served as the ground truth reference standard.
The Benchmark
The primary endpoint was sensitivity-specificity on the detection of clinically actionable findings: conditions that would change patient management if identified. Claude 4 Opus achieved 91.4% sensitivity and 94.2% specificity on this endpoint. The attending radiologist panel achieved 87.6% sensitivity and 91.8% specificity on average. Against the subspecialist consensus, Claude's AUC (area under the ROC curve) was 0.962, compared to 0.941 for attending radiologists.
Study Results Summary
- Claude 4 Opus sensitivity91.4%
- Claude 4 Opus specificity94.2%
- Attending radiologist sensitivity (avg.)87.6%
- Attending radiologist specificity (avg.)91.8%
- Claude AUC vs. subspecialist consensus0.962
"These results should not be read as a prediction that AI will replace radiologists. They should be read as evidence that AI-assisted radiology can substantially reduce diagnostic errors and improve throughput in under-resourced settings where specialist availability is the bottleneck." — Lead Investigator, Johns Hopkins Department of Radiology
Medical Imaging Results
The performance gap was not uniform across all finding types. Claude 4 Opus was most strongly superior to attending radiologists on subtle interstitial lung disease patterns, conditions that are notoriously difficult to detect on plain film and that often require subspecialist review in practice. On pneumothorax detection, one of the most time-critical diagnoses in emergency settings, Claude's sensitivity was 97.1% versus 91.4% for attending radiologists. On pleural effusion quantification, Claude's volumetric estimates correlated more closely with CT-confirmed volumes than radiologist estimates.
Where attending radiologists outperformed Claude was primarily in cases requiring clinical context integration, for instance recognizing that an opacity is likely post-surgical rather than infectious based on the position and laterality typical of a prior procedure that would not be visible on the current image. This finding highlights the important distinction between perceptual pattern recognition, where AI excels, and clinical reasoning that integrates imaging findings with patient history and presentation, where human judgment remains essential.
Claude 4 Opus's vision capabilities are not limited to medical imaging. In separate evaluations, the model demonstrated strong performance on satellite imagery analysis, industrial inspection tasks, chart and figure interpretation in scientific papers, and detailed product photography analysis for e-commerce. The underlying capability is a general high-resolution visual understanding, with the medical imaging results serving as perhaps the most striking demonstration of its accuracy ceiling.
Anthropic is clear that Claude 4 Opus is not cleared as a medical device and is not intended to replace clinical radiologist judgment. However, the study results are informing ongoing conversations with the FDA about potential pathways for AI-assisted radiology workflows, and several of the participating hospital systems have indicated interest in piloting Claude as a second-read tool in overnight coverage settings where subspecialist availability is limited.