ChatGPT as a global doctor: a rapid review of its performance on national licensing medical examination

Descripción del Articulo

Objective: To evaluate ChatGPT's performance onNLMEs worldwide and determine whether it could achieve licensure to practice medicine across different countries. Methods: We searched PubMed, Scopus, andGoogle Scholarforstudies evaluating ChatGPT's performance onNLMEs. Reference lists of inc...

Descripción completa

Detalles Bibliográficos
Autores: Flores-Cohaila, Javier A., Miranda Chavez, Brayan, Mayta-Tristán, Percy
Formato: artículo
Fecha de Publicación:2025
Institución:Colegio Médico del Perú
Repositorio:Acta Médica Peruana
Lenguaje:inglés
OAI Identifier:oai:amp.cmp.org.pe:article/3706
Enlace del recurso:https://amp.cmp.org.pe/index.php/AMP/article/view/3706
Nivel de acceso:acceso abierto
Materia:Medical education
Artificial Intelligence
ChatGPT
Generative Artificial Intelligence
Descripción
Sumario:Objective: To evaluate ChatGPT's performance onNLMEs worldwide and determine whether it could achieve licensure to practice medicine across different countries. Methods: We searched PubMed, Scopus, andGoogle Scholarforstudies evaluating ChatGPT's performance onNLMEs. Reference lists of included studies were also reviewed. Two reviewers independently screened studies and extracted the accuracy rates(performance) of GPT-3.5 and GPT-4, including those that passed thresholds, human examinee scores, and other study characteristics. The risk of bias was assessed using the JBI Critical Appraisal Checklist for Prevalence Studies. Results: We identified 37 studies evaluating ChatGPT's performance across 18 NLMEs. Most studies assessed the United States, Chinese, and Japanese examinations. While most studies used official datasets, others relied on unofficial third-party sources, and few employed advanced prompting techniques.GPT-4 wassuperiortoGPT-3.5 in allNLMEs, with accuracy rates ranging from 67% to 89%. GPT-4 passed all 18 NLMEs (100%), while GPT-3.5 passed 10 of 15 (67%). Compared to human examinees, GPT-4 outperformed the average score in 6 of 7 NLMEs (86%); the sole exception was Japan, where examinees achieved 84.9% versus 81.5% for GPT-4. Conclusion: Current evidence demonstrates that GPT-4 can pass all 18 NLMEs evaluated, surpassing human examinees in most cases. However, this finding likely reflects low passing thresholds rather than AI superiority over physicians.
Nota importante:
La información contenida en este registro es de entera responsabilidad de la institución que gestiona el repositorio institucional donde esta contenido este documento o set de datos. El CONCYTEC no se hace responsable por los contenidos (publicaciones y/o datos) accesibles a través del Repositorio Nacional Digital de Ciencia, Tecnología e Innovación de Acceso Abierto (ALICIA).