Yan Peng

Yan is a Master’s student in Artificial Intelligence at Monash University. His research focuses on applying large language models (LLMs) to automate the detection of falsified academic papers.


ABSTRACT:

Using Artificial Intelligence to assess the Trustworthiness of RCTs

Yan Peng(1), Minh Cao(3), Nicole Au(2), Lizhen Qu(1), Lyle C Gurrin(4), Ben W Mol(2,5)
(1)Department of Data Science & AI, Monash University, Clayton, Victoria, Australia
(2)
(3)Department of Obstetrics and Gynaecology, Monash University, Melbourne, Victoria, Australia
(4)Centre for Epidemiology and Biostatistics, School of Population and Global Health, The University of Melbourne, Parkville, Australia
(5)Department of Obstetrics and Gynaecology, Amsterdam University Medical Centre, Amsterdam, The Netherlands

Background and Objective: Randomized controlled trials (RCTs) are fundamental to evidence-based medicine, yet recent findings show they are not always based on genuine data. Such integrity violations compromise systematic reviews and meta-analyses, misguiding clinical guidelines and risking public health. Current methods to detect problematic RCTs are time-consuming and labor-intensive. In our proof-of-concept paper, we proposed that large language models (LLMs) could automate data extraction, analysis, and integrity checklist application, thereby accelerating the assessment of RCT trustworthiness.[1]

Methods: We now present follow-up results: [1] development of software that automates PDF data extraction and statistical validation, and [2] evaluating the accuracy and efficiency of using LLMs to assess items from the TRACT checklist [2].

Results: We built a Python prototype for table extraction using multimodal LLMs. Our automated data extraction showed promising performance. In a validation set of 35 tables from 10 papers, GPT-4 (OpenAI) achieved an F1 score of 0.748, while Qwen2.5 (Alibaba) reached 0.949. But challenges remain with complex layouts such as merged cells, nested tables, and irregular spacing.

LLM showed agreement with human assessors for an average of 14 of 19 TRACT checklist items. It performed well on items that could be directly assessed from the PDF, but less well on items that required more extensive background research and comparison with information available in the existing literature or online sources. We also compared time efficiency, LLM required 10 minutes per paper, compared with 75 minutes for manual screening, suggesting a substantial time reduction.

[1]Au LS, et al. Using artificial intelligence to semi-automate trustworthiness assessment of randomized controlled trials: a case study. J Clin Epidemiol. 2025;180:111672.

[2]Mol BW, et al. Checklist to assess Trustworthiness in RAndomised Controlled Trials (TRACT checklist): concept proposal and pilot. Res Integr Peer Rev. 2023;8:6.