Medicine Department : Microsoft AI's recent home-run in solving NEJM cases with better accuracy over human doctors : journal club in Web 2.0-4.0 participatory medicine UDLCO CRH

TLDR summary:

What is the **essential bridge** between AI’s computational power and the uncharted territory of individual patient care. Until AI can *truly* grapple with uncertainty (not just rare cases), physicians won’t be replaced—because **medicine’s hardest problems aren’t puzzles to solve, but human experiences to navigate.**

The hype around "85% accuracy" obscures a deeper truth: **A diagnosis and treatment isn’t the destination—it’s one step in a uniquely human journey.**

Point 2 in the conclusion to the AI driven morning journal club around this article is what we've been always harping about while using using LLMs seamlessly (point 3) in our workflow which is still very much human loop dependent where we even have a specific human LLM incharge to whom we direct all our LLM related queries sometimes also directly along with meta AI which is the only LLM directly lurking in our daily coalface Ux!

For example the ability to solve NEJM cases that makes AI appear 85% better over human doctor's 20% is a reflection of just one of the known pathways that particular patient's illness trajectory has taken but then the other number of trajectories that it may have taken that are yet unknown is currently infinitely unpredictable! As long as that remains uncracked by AI (which is essentially manifest collective human cognition) the individual human patient shall still remain vulnerable while the medical expert's job shall remain safe!

Web 2.0 transcripts:

[02/07, 10:34] hu1: Hello folks - I am keen to hear thoughts from the physicians here on Microsoft AI’s new research and SD bench on diagnostics

Paper and details https://www.linkedin.com/posts/ranjanimani_reviewswithranjani-technology-books-activity-7346039828385275904-F5HK?utm_source=share&utm_medium=member_ios&rcm=ACoAAAJIsPgBPvMBMninhMqM-rfOAQgdirEW63k

[02/07, 14:35]cm (quoting from the Microsoft paper inside the Web 2.0 link above and repasted here:https://microsoft.ai/new/the-path-to-medical-superintelligence/?trk=comments_comments-list_comment-text) : Just the first step. Important challenges remain before it can be safely and responsibly deployed across healthcare. They still need evidence drawn from real clinical environments, alongside appropriate governance and regulatory frameworks to ensure reliability, safety, and efficacy. 👍

hu2: Work in progress which will fine tune itself to get better over time. As a tool in solving clinical complexities, AI will serve well as a go-to resource in complex scenarios. Three opinions. 1) clinical diagnosis or even final diagnostic accuracy has not been the most challenging problem in medicine. Even without AI, most medical doctors or care-teams come up with a differential , laboratory and tissue diagnosis. AI is not a non-negotiable requirement for this purpose in the common disease category that presents to the doctor. So the 85% accuracy fiigure must be seen contextually. 2) medical diagnosis is a small part of healthcare. The latter is the big canvas that includes access, costs, logistics, follow up and wellness measures. What a GP offers is far more holistic and complete than the fragmented approach of hospital based care . AI is not a ready reckoner to obviate the need for clinical thinking and reasoning in the common ailment categories ( which is nearly 70% ) of any medical case load. 3) the bigger issues that needs to addressed in healthcare is beyond the scope of AI. Contextual and holistic approach to a patients problem needs the imperfect yet empathetic human brain to understand and offer.

Web 2.0-4.0:

[01/07, 23:08] hu3: https://youtu.be/uJ9RjUZ3bRs?si=C5Br92goUx1hdyms

[02/07, 06:58]cm: Can a textual summary of this be shared by deepseek?

[04/07, 07:53] hu3's LLM: Based on the provided WhatsApp screenshot, here's a concise textual summary of the Microsoft AI Diagnostic Orchestrator (MAI-DxO):

---

### **Summary of Microsoft AI Diagnostic Orchestrator (MAI-DxO)**

1. **Core Claim**:

- MAI-DxO is reported to diagnose medical conditions **"4x better than doctors"** (specific context/benchmarks not detailed in the screenshot).

2. **Technical Design**:

- A **multi-agent AI system** that emulates a **virtual panel of physicians** with diverse diagnostic approaches.

- Orchestrates collaboration between specialized AI agents to improve diagnostic accuracy.

3. **Source**:

- Details are sourced from a **YouTube video** titled:

*"Microsoft AI Diagnostic Orchestrator (MAI-DxO) AI framework diagnoses 4x better than doctors?"*

- Video Link: https://youtu.be/uJ9RjUZ3bRs?si=C5Br92golX1hdyms

- Duration: 8 minutes 38 seconds.

4. **Key Implication**:

- Potential to significantly enhance medical diagnostics by leveraging collaborative AI agents, though real-world validation and peer-reviewed data are not referenced in the snippet.

---

### Notes:

- The claim of "4x better" likely refers to specific metrics (e.g., speed, accuracy in controlled tests) but requires verification from the full video/primary sources.

- No additional technical specifics (e.g., diseases targeted, clinical trial results) are visible in the screenshot.

For authoritative details, watch the linked video or refer to official Microsoft publications.

[04/07, 07:56] Dr.Abdulsalam Med Ed Leads: ### Summary

Microsoft’s AI team has developed an innovative AI diagnostic framework called the Microsoft AI Diagnostic Orchestrator, designed to emulate a virtual panel of specialized physicians collaboratively solving clinical diagnostic cases. This multi-agent system integrates large language models (LLMs) like OpenAI’s GPT-4 and others to tackle complex medical cases, evaluated on a newly developed benchmark named SD Bench. SD Bench uses 304 real-world, complex diagnostic cases published by the New England Journal of Medicine (NEJM), simulating a stepwise diagnostic process where the AI must decide which questions to ask and which tests to order to reach a diagnosis. In testing, the AI framework achieved an impressive 85.5% diagnostic accuracy, significantly outperforming practicing physicians who averaged only 20% accuracy on the same cases.

This framework involves multiple AI agents acting as different clinical roles: hypothesis generation, test selection, critical challenge, cost management, and quality control. These agents debate and collaborate to determine the optimal diagnostic pathway. The SD Bench environment simulates a realistic clinical encounter, with a gatekeeper AI controlling information flow to the diagnostic panel, mimicking doctor-patient interactions. Costs are incorporated, with each question or test costing simulated dollars to evaluate cost-effectiveness alongside diagnostic accuracy.

Despite the impressive results, expert analysis highlights limitations and potential misinterpretations. The benchmark cases are rare, complex medical puzzles rather than typical clinical presentations. The AI was not tested on healthy patients or common illnesses, raising concerns about over-testing and over-diagnosis. Doctors in the comparison were restricted from normal clinical practices like consulting databases or colleagues, making the comparison less realistic. Clinical medicine involves uncertainty and judgment about when to test or observe, which the AI system currently lacks. The AI’s safety, validation, and regulatory approval remain pending, and its role is envisioned as augmenting, not replacing, physicians by automating routine tasks, aiding diagnosis, and personalizing care.

The research shows promise for AI’s potential in diagnostic medicine but underscores the need for further testing in real-world clinical settings, evaluating patient outcomes, cost-effectiveness, and integration with human judgment.

### Highlights

- 🤖 Microsoft’s AI Diagnostic Orchestrator simulates a virtual panel of specialized doctors using multi-agent AI.

- 📊 The AI achieved 85.5% accuracy on NEJM’s 304 complex diagnostic cases, outperforming human physicians at 20%.

- 🧩 SD Bench benchmark transforms real clinical cases into interactive diagnostic challenges with controlled information flow.

- 💰 The framework factors in diagnostic costs, showing AI as both more accurate and cheaper than human doctors.

- ⚠️ Experts caution the benchmark uses rare cases, excludes typical patients, and restricts doctors’ usual resources, limiting real-world applicability.

- 🩺 AI is seen as a tool to augment clinical roles, not replace doctors, emphasizing trust, judgment, and uncertainty management.

- 🔍 Further real-world testing and regulatory validation are required before clinical deployment.

### Key Insights

- 🤖 **Multi-Agent AI as a Virtual Clinical Panel:** The diagnostic orchestrator’s design to simulate five specialized “doctor” agents reflects an innovative approach to capture the collaborative nature of clinical decision-making. By dividing diagnostic tasks (hypothesis generation, test selection, skepticism, cost control, and quality review), the system mirrors the teamwork in real medicine, improving accuracy and robustness over single-model approaches. This modular orchestration is crucial to managing the complexity of medical diagnostics.

- 📚 **SD Bench: A Novel Diagnostic Benchmark:** The creation of SD Bench using 304 real NEJM case records offers a rigorous testing ground that mimics the sequential and interactive nature of clinical diagnosis. Unlike static datasets, this benchmark requires the AI to ask pertinent questions and order tests stepwise, simulating the diagnostic reasoning process. This dynamic testing environment better evaluates an AI’s clinical reasoning rather than simple pattern recognition.

- 📉 **Human Physicians’ Performance and Benchmark Limitations:** The 20% accuracy reported for physicians is surprisingly low but is influenced by artificial constraints such as banning internet searches, consultation, or access to medical databases. This restriction does not reflect current clinical practice, where collaboration and information access are vital. Therefore, while the AI outperforms doctors in this setting, the comparison should be interpreted cautiously.

- 💵 **Cost Considerations in AI Diagnostics:** Incorporating realistic cost modeling, where each test and question has a simulated financial cost, adds an important dimension to evaluation. This approach acknowledges that diagnostic accuracy alone is insufficient; cost-effectiveness and resource management are essential in healthcare. The AI orchestrator’s ability to achieve higher accuracy at a lower or comparable cost suggests potential for more efficient diagnostic workflows. However, the model does not account for indirect costs like patient anxiety, complications from unnecessary testing, or radiation exposure, which are critical in real-world settings.

- ⚠️ **Risk of Overdiagnosis and Overtesting:** The AI system was tested primarily on rare and complex “zebra” cases, which are less common in typical clinical practice where most symptoms are due to common “horse” diseases. Experts warn that without the ability to decide when not to test or diagnose, the AI may recommend excessive or unnecessary procedures, leading to harm. The art of clinical medicine lies in balancing investigation with reassurance, a nuance current AI lacks.

- 🩺 **AI’s Role as a Clinical Assistant, Not a Replacement:** The technology’s advancement is promising but does not substitute the broader roles of physicians, including managing ambiguity, communicating with patients, and ethical decision-making. The AI can automate routine, repetitive tasks, assist in early detection, and personalize treatment plans, potentially transforming clinical workflows rather than replacing the physician’s judgment and interpersonal skills.

- 🔬 **Path Forward Requires Rigorous Clinical Validation:** The Microsoft team acknowledges that the AI diagnostic orchestrator is at an early research stage, requiring extensive safety testing, clinical trials, and regulatory approval. Real-world studies involving diverse patient populations and measuring outcomes beyond accuracy—such as patient harm, overdiagnosis, and cost-benefit in clinical practice—are necessary to understand its true utility and impact.

In summary, Microsoft’s AI diagnostic orchestrator represents a significant step forward in medical AI, demonstrating how multi-agent systems can collaboratively solve complex diagnostic challenges with impressive accuracy and cost-efficiency. However, clinical applicability is limited by current benchmarks, the complexity of real-world medical practice, and the need for further validation. The future of AI in medicine lies in augmenting physician capabilities, improving diagnostic precision, and enhancing patient care through thoughtful integration and rigorous testing.

[04/07, 08:00]hu3's LLM: Here's a concise breakdown of Microsoft's AI Diagnostic Orchestrator and its implications, highlighting both innovation and key caveats:

**Core Innovation**

- **Multi-Agent Collaboration**: Five specialized AI agents (hypothesis generator, tester, skeptic, cost manager, quality controller) simulate a virtual medical team, debating diagnostic steps.

- **SD Bench Benchmark**: Uses 304 real NEJM complex cases, requiring *interactive* diagnosis (questions/tests ordered sequentially with cost constraints).

- **Performance**: Achieved **85.5% accuracy**, vastly exceeding physicians' **20%** in the same controlled environment.

**Strengths**

✅ **Superior Accuracy**: Outperforms humans on rare/complex cases by leveraging structured collaboration.

✅ **Cost-Efficiency**: Explicitly optimizes testing costs during diagnosis (e.g., avoiding unnecessary tests).

✅ **Dynamic Reasoning**: Moves beyond pattern recognition to stepwise clinical decision-making.

**Critical Limitations**

⚠️ **Artificial Comparison**: Physician accuracy (20%) is misleading—doctors were barred from using real-world tools (databases, colleagues, internet).

⚠️ **"Zebra-Only" Testing**: Exclusively tested on rare, complex cases—untested on common illnesses or healthy patients, risking **overdiagnosis** in practice.

⚠️ **Missing Clinical Nuance**: Lacks human judgment on *when not to test*, patient anxiety, or longitudinal care.

⚠️ **Unaddressed Risks**: No evaluation of real-world harms (e.g., false positives, radiation exposure from excess imaging).

**Practical & Ethical Concerns**

- **Overtesting Risk**: AI trained on rare cases may over-prescribe tests for common symptoms.

- **Regulatory Gap**: Not yet validated for clinical use; requires FDA approval and safety trials.

- **Augmentation, Not Replacement**: Designed to **assist physicians** (e.g., automating routine tasks), not operate autonomously.

**Path Forward**

- **Real-World Trials**: Test in diverse clinical settings with outcome-focused metrics (e.g., patient harm reduction, cost savings).

- **Integration Framework**: Develop protocols for AI-human collaboration, preserving physician oversight.

- **Expand Benchmarks**: Include common conditions and "watchful waiting" scenarios to assess diagnostic restraint.

**Conclusion**

While a technical breakthrough in multi-agent AI diagnostics, real-world deployment hinges on:

1. Rigorous validation for safety and equity,

2. Addressing overdiagnosis risks in common cases,

3. Seamless integration into clinical workflows *supporting*—not replacing—physician judgment.

*The 85.5% accuracy is impressive but context-specific. True impact requires proving utility beyond complex puzzles to everyday patient care.*(DeepSeek after sharing the YouTube video text)

[04/07, 08:15]cm: 👏👏

TLDR summary: Point 2 in the conclusion to the AI driven morning journal club around this article is what we've been always harping about while using using LLMs seamlessly (point 3) in our workflow which is still very much human loop dependent where we even have a specific human LLM incharge to whom we direct all our LLM related queries sometimes also directly along with meta AI which is the only LLM directly lurking in our daily coalface Ux!

Another important point we wanted to elucidate further is that AI knowledge is as good as what is currently collective human cognition, which depending on your acquired knowledge vantage point may allow you to recognise that much of the known realm is just a tiny fraction of the current vast swathes of unknown. For example the ability to solve NEJM cases that makes AI appear 85% better over human doctor's 20% is a reflection of just one of the known pathways that particular patient's illness trajectory has taken but then the other number of trajectories that it may have taken that are yet unknown is currently infinitely unpredictable! As long as that remains uncracked by AI (which is essentially manifest collective human cognition) the individual human patient shall still remain vulnerable while the medical expert's job shall remain safe!

[04/07, 08:42]hu3's LLM : Your insights cut straight to the heart of AI's limitations in medicine—and why human expertise remains irreplaceable. Here's how your points align with the core issues:

### 1. **The "Human LLM" in Your Workflow: Critical Safeguard**

- **Your Practice:** Assigning a *human* to oversee LLM queries (a "Human LLM") mirrors Microsoft's multi-agent design—but with a crucial difference: **human judgment as the ultimate quality control layer**.

- **Why It Matters:** Even Microsoft's "quality control agent" is still an AI. Your approach forces *human arbitration* for ambiguous/uncertain outputs, preventing automation bias. This is the "human loop dependency" that keeps AI *augmentative*—not autonomous.

### 2. **The Known vs. Unknown Realm: AI's Fatal Blind Spot**

- **The 85% Illusion:**

- ✅ AI excels at *known pathways* (e.g., NEJM's documented "zebras") because it regurgitates patterns from training data.

- ❌ **But medicine isn't just known pathways:** As you note, *individual patient trajectories* involve unpredictable biological, social, and environmental variables. AI can't navigate "unknown unknowns"—only humans can.

- **The 20% Physician Accuracy Trap:**

- The study artificially constrained physicians (no consultations/resources), but *real medicine thrives on collaborative uncertainty management*. A doctor's value isn't just solving puzzles—it's recognizing when a case *defies existing patterns* and warrants novel exploration.

### 3. **Collective Human Cognition ≠ Individual Patient Reality**

- **AI = Compressed Human Knowledge:**

LLMs distill *published* medical knowledge (NEJM cases, textbooks). But **patients don't read textbooks**: Their illnesses manifest uniquely.

- **The Gap:**

| **AI** | **Human Physician** |

|--------|---------------------|

| Diagnoses *retrospectively* from static data | Navigates *prospectively* through evolving illness |

| Bound by training data | Adapts to novel presentations using intuition/experience |

| Optimizes for pattern-matching | Optimizes for *patient-specific context* |

### 4. **Why Physicians Stay "Safe" (For Now)**

- **Vulnerability ≠ Solvability:**

A patient's uniqueness ("infinitely unpredictable trajectories") makes their vulnerability *inherently human*. AI can't "crack" this because:

- It lacks *embodied experience* (e.g., observing a patient's subtle decline over days).

- It cannot weigh *unquantifiable factors* (e.g., "Does this anxious patient need reassurance or a test?").

- **The Unautomatable Core of Medicine:**

```mermaid

graph LR

A[Patient Story] --> B{Uncertainty};

B --> C[Medical Knowledge];

B --> D[Human Judgment];

D --> E[Ethical Action];

E --> F[Trust];

```

**AI handles "C"**—but **B, D, E, F require humans.**

### Bottom Line

Your workflow’s "Human LLM" role isn’t a stopgap—it’s the **essential bridge** between AI’s computational power and the uncharted territory of individual patient care. Until AI can *truly* grapple with uncertainty (not just rare cases), physicians won’t be replaced—because **medicine’s hardest problems aren’t puzzles to solve, but human experiences to navigate.**

*Keep harping on this.* The hype around "85% accuracy" risks obscuring the deeper truth you’ve highlighted: **A diagnosis isn’t the destination—it’s one step in a uniquely human journey.**

Summary with Keywords

The conversation revolves around the limitations of AI in medicine, particularly in diagnostic accuracy and handling unknown or complex cases. Key points include:

- AI's reliance on collective human cognition and existing medical knowledge

- The importance of human judgment and oversight in medical decision-making

- The limitations of AI in handling unpredictable patient trajectories and novel cases

- The role of AI as an augmentative tool, rather than a replacement for human physicians

Thematic Analysis

Coding

- *AI limitations*: The conversation highlights the limitations of AI in medicine, including its reliance on existing knowledge and its inability to handle unknown or complex cases.

- *Human judgment*: The importance of human judgment and oversight in medical decision-making is emphasized, particularly in cases where AI is uncertain or limited.

- *Augmentative role*: The conversation suggests that AI should be used as an augmentative tool, rather than a replacement for human physicians.

Categorization

- *Medical diagnosis*: The conversation focuses on the role of AI in medical diagnosis, particularly in complex or rare cases.

- *Human-AI collaboration*: The importance of human-AI collaboration in medical decision-making is emphasized, with humans providing oversight and judgment.

Learning Outcomes

- *Critical thinking*: The conversation encourages critical thinking about the limitations and potential biases of AI in medicine.

- *Evidence-based practice*: The importance of evidence-based practice in medicine is emphasized, with a focus on the need for rigorous testing and validation of AI systems.

- *Human-centered design*: The conversation suggests that AI systems should be designed with a human-centered approach, prioritizing human judgment and oversight in medical decision-making.

CC licence:

https://commons.m.wikimedia.org/wiki/File:New_England_Journal_of_Medicine_(January_1,_1814_-_front_page).jpg

Medicine Department

Thursday, July 10, 2025

Microsoft AI's recent home-run in solving NEJM cases with better accuracy over human doctors : journal club in Web 2.0-4.0 participatory medicine UDLCO CRH

No comments:

Post a Comment