Model information
Description
The study characteristic extraction model automatically extracts a set of study characteristics. It retrieves values for the following data points:
Author’s name (first author)
Address (first author)
Country
Email
Institution (first author)
Sponsorship source (funding sources)
We use the following tools to automate these fields:
OpenAlex | Large Language Model (LLM) |
|---|---|
Author’s name | Address |
Inputs
OpenAlex
The OpenAlex API requires users to provide a valid study DOI to retrieve results.
Large Language Model (LLM)
The model extracts fields from the full PDF.
Outputs
Text string (suggested extraction field values). Some extraction field values are accompanied by supporting evidence from the study (e.g. verbatim quote citing a funding source in the case of sponsorship suggestions).
Model data
OpenAlex
OpenAlex aggregates and curates data from various sources, including ORCID, ROR, DOAJ, and Pubmed. More details are available here.
Training dataset
Not applicable - We solely rely on querying the OpenAlex data to extract relevant information.
Training approach
Not applicable - We solely rely on querying the OpenAlex data to extract relevant information.
Large language model
We use existing large language models to extract relevant data.
Training dataset
We use samples from a historical dataset to inform model design. Instead of using the data directly, we analyse patterns from the selected dataset to create prompts.
Training approach
We do not conduct direct training; instead, we use existing Large Language Models like Gemini Flash and tailor prompts to guide the model in making accurate predictions.
Evaluation
Approach
To assess suitability, we compared the acceptance rate to typical human accuracy rates reported in various publications: Li et al (2019) and King et al (2024). These papers indicate that human accuracy ranges from 80-85%, based on a 15-20% reported error rate for data extraction.
Sponsorship source
We curated a gold dataset containing 107 open access studies, including a wide range of study types, journals, publishers and publication ages. The dataset contains 64 positive-case studies with at least one funding source and 43 negative-case studies with no funding source reported (either explicitly reported no funding or not reported at all).
We evaluated the funding source extraction model using a combination of an automated LLM judge and human reviewers. Extracted values and supporting quotes were both assessed by an LLM judge and human reviewers.
LLM judgements prioritised which extractions needed human review. Low-priority cases were where extractions matched the ground truth. High-priority cases were where the extraction either missed ground truth data or included data not found in the ground truth.
For extracted values, two human evaluators individually assessed all LLM judgements independently, followed by a consensus stage. They conducted a deep dive on high-priority judgements, verifying against the PDF, and a lighter review on low-priority ones, verifying against the ground truth text.
For supporting quotes, we used an asymmetric review approach, where one human evaluator completed the primary review of all LLM judgements, with deep dives on high‑priority rows, verifying against the PDF, and lighter checks on low‑priority ones, verifying against the ground truth text. The second evaluator then reviewed these judgements, focusing on disagreements and edge cases. Any differences were discussed and resolved by the two reviewers.
All other fields
We recorded user interactions with extraction suggestions in Covidence’s Extraction 1 offering from 8th June 2025 to 4th December 2025 (141,554 data points), focusing on the acceptance and rejection of suggested values.
The acceptance rate was calculated by dividing the number of accepted suggestions by the total number of accepted and rejected suggestions.
Results
Sponsorship source
The performance for values and supporting quotes were:
Attribute | Precision | Recall |
Extracted value | 92.2% (95% CI: 88.6% – 95.7%) | 100.0% (95% CI: 100.0% – 100.0%) |
Supporting quotes | 87.1% (95% CI: 81.8% – 92.1%) | 99.3% (95% CI: 97.7% – 100.0%) |
The LLM over-identified (false positive) for a small number of studies. This was primarily due to conflicts of interest and non-financial study sponsors being extracted. The model did not hallucinate when extracting values and supporting quotes.
All other fields
The mean acceptance rate for all study characteristic attributes is 97.44% (95% CI: 96.37%, 98.50%), based on a sample size of 136,081 data-points.
The performance per attribute was:
Attribute | Acceptance rate | Sample size |
|---|---|---|
Author’s name | 98.20% | 53899 |
Institution | 97.00% | 34188 |
Country | 98.00% | 28923 |
98.47% | 12256 | |
Address | 95.51% | 6815 |
Intended usage & limitations
Benefit & intended usage
The model can be used to provide highly-accurate extraction suggestions to reviewers completing data extraction, saving time and effort when extracting study characteristic fields.
Known limitations
Model limitations
Only available for studies with a DOI linked in Covidence.
Access to the full-text PDF is required to retrieve the Address, Country, Email, and Sponsorship Source fields, either through an accessible and readable open access link or user-uploaded content.
Coverage of extraction values for the Author's name and Institution fields depends on data availability in OpenAlex.
Evaluation limitations
The evaluation for author’s name, institution, country, email, and address are based on interaction with suggestions, which may have influenced human judgments and impacted the acceptance rates.
The evaluation for sponsorship source extractions is based on a curated sample and may not cover every journal style. Performance could dip on unusual layouts or formats.
Sponsorship source extractions are evaluated on English-language papers only. Performance may differ for papers written in other languages.