Generative AI models have gained traction in healthcare settings, with the promise of increased efficiency and the ability to uncover valuable insights. However, critics warn that these models have inherent flaws and biases that could lead to worse health outcomes. In order to assess the usefulness and potential harm of these models in tasks such as patient record summarization or health-related question answering, Hugging Face, an AI startup, has introduced a benchmark test called Open Medical-LLM. This benchmark aims to standardize the evaluation of generative AI models on various medical tasks.
Open Medical-LLM is not a new creation but rather a combination of existing test sets, including MedQA, PubMedQA, and MedMCQA. It focuses on testing models’ knowledge in areas such as anatomy, pharmacology, genetics, and clinical practice. The benchmark consists of multiple-choice and open-ended questions that require medical reasoning and understanding, drawing from sources such as U.S. and Indian medical licensing exams and biology test question banks.
Hugging Face emphasizes that Open Medical-LLM provides a robust assessment of healthcare-bound generative AI models. However, some medical experts caution against relying too heavily on this benchmark, as it may lead to ill-informed deployments. They argue that the gap between the controlled environment of medical question-answering and real clinical practice is significant.
One resident physician in neurology at the University of Alberta points out that although the benchmark provides valuable comparisons, it fails to capture the nuances and idiosyncrasies of actual clinical practice. He also highlights the risks associated with relying solely on metrics that do not fully capture real-world complexities.
Hugging Face research scientist Clementine Fourrier acknowledges these concerns and advises that the benchmark should only be used as an initial exploration of potential generative AI models. She emphasizes that further testing and evaluation in real-world conditions are necessary to understand the limits and relevance of these models. Fourrier argues that medical models should not be used directly by patients but should instead be trained as support tools for healthcare professionals.
This discussion raises the example of Google’s experience with an AI screening tool for diabetic retinopathy in Thailand. Despite high theoretical accuracy in scanning eye images for signs of retinopathy, the tool proved impractical in real-world settings. It yielded inconsistent results and did not align well with the practices and expectations of patients and nurses.
The challenges faced by generative AI models are further highlighted by the fact that none of the 139 AI-related medical devices approved by the U.S. Food and Drug Administration utilize generative AI. Testing how these models perform in controlled lab settings and their potential translation to hospitals and clinics are complex tasks. Understanding the long-term outcomes of these models is also crucial.
While Open Medical-LLM provides valuable insights into the limitations of generative AI models in healthcare, it should not be considered a substitute for comprehensive real-world testing. The benchmark serves as a reminder of the models’ shortcomings in answering basic health questions. However, it cannot replace the careful evaluation and consideration of these models in actual clinical practice.
In conclusion, the introduction of Open Medical-LLM as a benchmark test for evaluating generative AI models in healthcare is a step in the right direction. However, it is important to recognize the limitations of such benchmarks and the need for thorough real-world testing. The ultimate goal should be to develop generative AI models that can serve as supportive tools for healthcare professionals, improving patient care and outcomes.
Source link