AI’s Potential to Transform Assessments

On December 19, 2024

Opening plenary speakers for ABMS Conference 2024 touted the potentially transformative nature of artificial intelligence (AI) and explored the myriad opportunities it presents for American Board of Medical Specialty (ABMS) Member Boards to enhance their work in a new era of competence.

Member Boards’ AI journey

The launch of ChatGPT in November 2022 created a seismic shift around the world as it quickly became one of the internet’s most referenced and tested programs, stated Michele S. Pore, MBA, CAE, Executive Director of Administrative Affairs at the American Board of Anesthesiology (ABA). Closer to home, reports about generative AI’s performance on certification examinations were popping up everywhere. “Our worlds were rattled as we tried to figure out what AI meant for the work we do and for the practice of medicine,” she said.

Michelle Pore AI plenary 2024 270x180

Among the benefits of AI are that it can reduce the time it takes to find, synthesize, and present relevant facts; improve efficiency of some administrative tasks, allowing physicians to spend more time practicing high-level patient care; augment an already burdened workforce; enhance the accuracy of diagnosis and treatment plans; and create a more equitable health care system.

But AI also has limitations, some of which pose significant challenges for Member Boards. “The output of AI is only as good as the input,” Pore stated. AI produces a large amount of inaccurate information or only partially correct information. While AI can enhance human expertise, it can’t replace physicians and their knowledge and judgment. AI can summarize data, but it doesn’t have the context, critical thinking, and judgment that humans have to make sense of the data, she said. AI poses ethical concerns regarding protecting patient data and privacy as well as regulatory concerns as the innovation has moved faster than regulation to harness it. Additionally, AI may create inequitable health care for certain subpopulations as the data contain biases.

Increasingly more Member Boards are exploring how AI can augment physician assessment. “AI won’t replace doctors, but doctors who know how to use AI will replace doctors who don’t,” Pore said, adding, “There may come a day when a doctor’s proficiency with using AI for practice will need to be assessed.”

Meanwhile, most Member Boards are focusing on AI’s impact on question development and the legal implications of not being able to copyright AI-generated material. To date, a handful of boards are using AI for item development, primarily for continuing certification assessments. About one-third of the boards are using it to enhance the efficiency of operational work. The boards are also managing concerns about the potential to cheat on longitudinal assessments using AI and on research studies comparing exam performance of humans versus AI, which do not take into account the complexity and sophistication of certification exam scoring.

In spring 2023, ABA approved piloting the use of AI to generate questions for longitudinal assessment but paused this effort due to copyright concerns. Soon after, ABA released an AI position statement recognizing the possibilities that it could bring to advancing the board’s mission but prohibiting use of AI for question development. ABA directors wanted to explore ways to safely integrate the technology while protecting the board’s intellectual property.

In February, ABA established a technology task force charged with generating AI pilots. Pilots currently underway include using AI to generate questions and high-quality meeting minutes and to summarize Help Desk themes to provide staff with education and support resources. These kinds of analyses that would take staff up to six months to complete would likely take AI a few minutes, Pore said. Proposed pilots addressing operations include summarizing documents and developing agenda books. AI should be designed to help staff with rote tasks so they can spend more time doing critical thinking, she said. ABA has activated Microsoft Copilot and plans to host workshops to provide staff with some AI experience.

ABA has developed an AI policy designed to balance innovation while safeguarding proprietary information. The board allows the use of AI with supervisory approval and encourages AI experimentation. The task force is expected to produce a final report next fall, providing recommendations for implementing AI-based pilot results and sharing a proposed structure for continued innovation in certification.

Integration and scalability

Victoria Yaneva, PhD, Manager of AI and Natural Language Processing (NLP) Research at the National Board of Medical Examiners (NBME), discussed key considerations for using generative AI to aid in the development of multiple-choice clinical questions in a scalable way. “I believe that in the distant future, we will have new types of assessments that have been transformed by this technology,” she said.

Dr Yaneva AI plenary 2024 270215180

In the current landscape, there is anecdotal evidence about using AI to assist with item writing. What is missing is evidence that integrating AI support leads either to productivity gains or other types of improvements, Dr. Yaneva said. But it’s not just about integration, she stressed, it’s about scalability.

Considerations for moving forward include keeping human experts at the core of item development. “The goal is not to replace item writers, but rather to give them AI to support their efforts,” she said. Using state-of-the-art AI methods is imperative, but medical education assessment is slow in adopting state-of-the-art AI. “This lag should be shortened because the more sophisticated and advanced the methods are, the better the outcomes will be,” Dr. Yaneva said. Using generalizable approaches will foster large scale assessments across numerous competencies and specialties instead of relying on refining specific prompts that provide only one type of item. It is important to evaluate the overall process so that integrating AI into one part of the process won’t cause bottlenecks in other parts. Moreover, data security is essential. “We don’t want to put our data out there in ChatGPT or other external software and give it to third parties,” she said. Currently, it is difficult to know how legislation around copyright protection will impact AI-assisted item writing as the law is still being developed. “That said, human creativity will always be needed for the development of clinical items, which might strengthen the case of why they should qualify for copyright protection,” Dr. Yaneva stated. Responsible use requires developing standards to meet specific criteria for exam items.

With these considerations in mind, NBME embarked on exploratory research in using AI to assist item writers with developing items. “We think of AI-assisted item generation as jumpstarting item writing,” she said. NBME’s NLP team trained an AI agent using its data to create 500 drafts of multiple-choice clinical questions across numerous subject areas. These drafts can be used as a starting point for item writers to edit and create items, improving efficiency and productivity along the way, Dr. Yaneva said.

The drafts must still undergo subject matter expert (SME) review to determine whether they are helpful, before being edited and finalized by item writers and other experts. This is where thinking through the whole process is relevant, she said. Subject matter experts have a limited amount of time and there are a limited number of SMEs. Thus, one must either introduce an automated evaluation that deletes the unhelpful drafts so the SMEs can focus on the helpful ones or generate enough high-quality drafts such that the time spent reviewing the overall sample is worthwhile. For example, if the SMEs found 20 AI drafts helpful, how does that compare with writing 20 items from scratch?

The NBME had 10 medical faculty review the 500 draft questions, keeping those rated as helpful by both reviewers. About 55 percent were approved; 45 percent were discarded (for many of the discarded drafts, the two item writers disagreed). Most reviewers reported that they found the process worthwhile. Next, more expert feedback will help the AI model better match reviewer preferences, she explained. Whether items are written with AI assistance or another automated method, they still need to go through an expert review process and pre-testing the same as traditionally developed items, Dr. Yaneva emphasized.

Using generative AI only to write multiple-choice questions, however, is a missed opportunity, she stated. The medical education community should be using AI to support existing assessments. “We want to be able to measure new constructs that we currently can’t measure through digital first AI-powered assessments,” she said. It will require many steps to get there, but the starting point is working with the current assessments and existing AI tools. “We can work and learn together,” Dr. Yaneva concluded.

The promise of AI in CBE

Jason R. Frank, MD, MA (Ed), FRCPC, FAOA (hon), Director of the Centre for Innovation in Medical Education and a Professor of Emergency Medicine at the University of Ottawa Faculty of Medicine, linked the promise of AI to competency-based education (CBE).

Dr Frank AI plenary 2024 270215180

Competency-based education was introduced in the late 1970s as an attempt to achieve a threshold standard for competence, Dr. Frank noted. But it has taken decades to move from an apprenticeship era in training to a systems-based era focused on competencies. The reality is that fixed time-training systems produce graduates with variable progression of expertise, and there is evidence in every specialty that some graduates leave training with gaps, he said. In the current training system, there are problems with assessment, supervision, and equity. Practice variability exists in countries around the world, not just in the United States. As a result, the systems are not producing the needed workforce. “We now know that time spent is not good enough because human beings learn things at different rates,” Dr. Frank stressed.

In the new CBE era, the mantra is no physician left behind. “We want every trainee to meet thresholds for safety and competence across a larger definition of competence,” he said. Not just in clinical expertise, but in communication skills, teamwork, and so on. Competency-based education is driving more direct observation and individualized tailored training as well as better coaching feedback and data on progress in authentic settings, Dr. Frank said.

At the same time, the notions of competence are changing. The focus has shifted from character to time, to knowledge and habits, and now to abilities in context. It may be time to move away from the concept of competence that has been defined by Miller’s Pyramid of Clinical Competence for the past 20-plus years as this definition may be holding the profession back, he said. It could be replaced with a multidimensional view of competence that is canonical, contextual, and personalized.

New assessments using AI could include work sampling via electronic health records, video analysis, virtual reality, and simulation; collation and curation of practice metrics; real-time learning analytics; practice profiling to tailor an assessment to a trainee’s unique scope; and precision education to tailor continuing professional development based on patient population. “Right now, you can do automatic sampling of a trainee’s procedure logs,” Dr. Frank said, adding, “In my era, it was a piece of paper or booklet we kept in our lab coat to write in.”

Dr. Frank proposes creating a “program of assessment for a medical career.” This competence ladder would be a shared competence framework adapted to every stage of a physician’s career. It could include blueprints for learning and assessment, an assessment for and of learning, and a new model of competence focusing on assessment. AI could be used for developmental, continuous sampling of physicians’ work from the time they were pre-certified, certified, and in practice.

There are pitfalls in using AI for assessment, he noted. These include hallucinations (AI makes stuff up) and lost nuances because it’s not human. Structural bias is baked into current AI models and there are concerns with overreach and privacy. It’s unclear if AI assessments for international medical graduates coming from different training systems would be compatible.

“We are in the competence business and it’s our job to change and adopt these new technologies,” Dr. Frank concluded.

© 2024 American Board of Medical Specialties

Related Articles

More Articles