Performance of ChatGPT, Bard, Claude, and Bing on the Peruvian National Licensing Medical Examination: a cross-sectional study.

Torres-Zegarra, B.C.; Rios-Garcia, W.; Ñaña-Cordova, A.M.; Arteaga-Cisneros, K.F.; Benavente-Chalco, X.C.; Bustamante-Ordoñez, M.A.; Gutierrez-Rios, C.J.; Ramos-Godoy, C.A.; Teresa Panta Quezada, K.L.; Gutiérrez-Arratia, J.D.; Flores-Cohaila, J.A.

Performance of ChatGPT, Bard, Claude, and Bing on the Peruvian National Licensing Medical Examination: a cross-sectional study.

Descripción del Articulo

Purpose We aimed to describe the performance and evaluate the educational value of justifications provided by artificial intelligence chatbots, including GPT-3.5, GPT-4, Bard, Claude, and Bing, on the Peruvian National Medical Licensing Examination (P-NLME). Methods This was a cross-sectional analyt...

Descripción completa

Detalles Bibliográficos
Autores:	Torres-Zegarra, B.C., Rios-Garcia, W., Ñaña-Cordova, A.M., Arteaga-Cisneros, K.F., Benavente-Chalco, X.C., Bustamante-Ordoñez, M.A., Gutierrez-Rios, C.J., Ramos-Godoy, C.A., Teresa Panta Quezada, K.L., Gutiérrez-Arratia, J.D., Flores-Cohaila, J.A.
Formato:	artículo
Fecha de Publicación:	2023
Institución:	Universidad Nacional de Cajamarca
Repositorio:	UNC-Institucional
Lenguaje:	inglés
OAI Identifier:	oai:repositorio.unc.edu.pe:20.500.14074/10222
Enlace del recurso:	http://hdl.handle.net/20.500.14074/10222 https://doi.org/10.3352/jeehp.2023.20.30
Nivel de acceso:	acceso abierto
Materia:	Medical education Educational measurement Artificial intelligence Peru https://purl.org/pe-repo/ocde/ford#5.03.01

Descripción
Sumario:	Purpose We aimed to describe the performance and evaluate the educational value of justifications provided by artificial intelligence chatbots, including GPT-3.5, GPT-4, Bard, Claude, and Bing, on the Peruvian National Medical Licensing Examination (P-NLME). Methods This was a cross-sectional analytical study. On July 25, 2023, each multiple-choice question (MCQ) from the P-NLME was entered into each chatbot (GPT-3, GPT-4, Bing, Bard, and Claude) 3 times. Then, 4 medical educators categorized the MCQs in terms of medical area, item type, and whether the MCQ required Peru-specific knowledge. They assessed the educational value of the justifications from the 2 top performers (GPT-4 and Bing). Results GPT-4 scored 86.7% and Bing scored 82.2%, followed by Bard and Claude, and the historical performance of Peruvian examinees was 55%. Among the factors associated with correct answers, only MCQs that required Peru-specific knowledge had lower odds (odds ratio, 0.23; 95% confidence interval, 0.09–0.61), whereas the remaining factors showed no associations. In assessing the educational value of justifications provided by GPT-4 and Bing, neither showed any significant differences in certainty, usefulness, or potential use in the classroom. Conclusion Among chatbots, GPT-4 and Bing were the top performers, with Bing performing better at Peru-specific MCQs. Moreover, the educational value of justifications provided by the GPT-4 and Bing could be deemed appropriate. However, it is essential to start addressing the educational value of these chatbots, rather than merely their performance on examinations.

Performance of ChatGPT, Bard, Claude, and Bing on the Peruvian National Licensing Medical Examination: a cross-sectional study.

Nota importante:
La información contenida en este registro es de entera responsabilidad de la institución que gestiona el repositorio institucional donde esta contenido este documento o set de datos. El CONCYTEC no se hace responsable por los contenidos (publicaciones y/o datos) accesibles a través del Repositorio Nacional Digital de Ciencia, Tecnología e Innovación de Acceso Abierto (ALICIA).

Performance of ChatGPT, Bard, Claude, and Bing on the Peruvian National Licensing Medical Examination: a cross-sectional study.

Descripción del Articulo

Ejemplares Similares