Model information
Description
The study characteristic extraction model automatically extracts a set of study characteristics using a Large Language Model (LLM). It retrieves values for the following data points:
Author’s name (first author)
Address (first author)
Country (in which study was conducted)
Email (corresponding author)
Institution (first author)
Sponsorship source (funding sources)
Inputs
The model extracts fields from the full PDF.
Outputs
Text string (suggested extraction field values). Some extraction field values are accompanied by supporting evidence from the study (e.g. verbatim quote citing a funding source in the case of sponsorship suggestions).
Model data
We use existing large language models to extract relevant data.
Training dataset
We use samples from a historical dataset to inform model design. Instead of using the data directly, we analyse patterns from the selected dataset to create prompts.
Training approach
We do not conduct direct training; instead, we use existing Large Language Models like Gemini Flash and tailor prompts to guide the model in making accurate predictions.
Evaluation
Approach
To assess suitability, we compared the acceptance rate to typical human accuracy rates reported in various publications: Li et al (2019) and King et al (2024). These papers indicate that human accuracy ranges from 80-85%, based on a 15-20% reported error rate for data extraction.
We curated a gold dataset of open access studies, including a wide range of study types, journals, publishers and publication ages.
We evaluated the extraction model using a combination of an automated LLM judge and human reviewers. Extracted values and supporting quotes were both assessed by an LLM judge and human reviewers.
LLM judgements prioritised which extractions needed human review. Low-priority cases were where extractions matched the ground truth. High-priority cases were where the extraction either missed ground truth data or included data not found in the ground truth.
For extracted values, two human evaluators individually assessed all LLM judgements independently, followed by a consensus stage. They conducted a deep dive on high-priority judgements, verifying against the PDF, and a lighter review on low-priority ones, verifying against the ground truth text.
For supporting quotes, we used an asymmetric review approach, where one human evaluator completed the primary review of all LLM judgements, with deep dives on high‑priority rows, verifying against the PDF, and lighter checks on low‑priority ones, verifying against the ground truth text. The second evaluator then reviewed these judgements, focusing on disagreements and edge cases. Any differences were discussed and resolved by the two reviewers.
Results
The performance for extracted values were:
Field | n | Precision | Recall |
Author’s name | 142 | 93.7% (95% CI: 89.4% – 97.2%) | 100.0% (95% CI: 100.0% – 100.0%) |
Address | 142 | 95.7% (95% CI: 92.2% – 98.6%) | 99.3% (95% CI: 97.8% – 100.0%) |
Country | 142 | 96.3% (95% CI: 93.1% – 99.3%) | 99.2% (95% CI: 97.6 – 100.0) |
142 | 99.3% (95% CI: 97.9% – 100.0%) | 100.0% (95% CI: 100.0% – 100.0%) | |
Institution | 142 | 94.3% (95% CI: 90.0% – 97.9%) | 99.3% (95% CI: 97.7% – 100.0%) |
Sponsorship source | 107 | 92.2% (95% CI: 88.6% – 95.7%) | 100.0% (95% CI: 100.0% – 100.0%) |
The performance for supporting quotes were:
Field | n | Precision | Recall |
Author’s name | 142 | 98.5% (95% CI: 96.3% – 100.0%) | 95.0% (95% CI:91.4% – 97.9) |
Address | 142 | 94.4% (95% CI: 90.1% – 97.9%) | 100.0% (95% CI:100.0% – 100.0) |
Country | 142 | 97.7% (95% CI: 94.9% – 100.0%) | 96.2% (95% CI:92.6% – 99.2) |
142 | 97.1% (95% CI: 94.2% – 99.3%) | 99.3% (95% CI:97.1% – 100.0) | |
Institution | 142 | 95.8% (95% CI: 92.3% – 98.6%) | 100.0% (95% CI:100.0% – 100.0) |
Sponsorship source | 107 | 87.1% (95% CI: 81.8% – 92.1%) | 99.3% (95% CI: 97.7% – 100.0%) |
Intended usage & limitations
Benefit & intended usage
The model can be used to provide highly-accurate extraction suggestions to reviewers completing data extraction, saving time and effort when extracting study characteristic fields.
Known limitations
Model limitations
Access to the full-text PDF is required, either through an accessible and readable open access link or user-uploaded content.
Older, locked and scanned PDFs may not be readable, resulting in limited performance for these studies.
Evaluation limitations
The evaluation is based on a curated sample and may not cover every journal style. Performance could dip on unusual layouts or formats.
Extractions are evaluated on English-language papers only. Performance may differ for papers written in other languages.