Model details: Extraction suggestions

Model information

Description

The study characteristic extraction model automatically extracts a set of study characteristics. It retrieves values for the following data points:

Author’s name (first author)
Address (first author)
Country
Email
Institution (first author)
Sponsorship source (funding sources)

We use the following tools to automate these fields:

OpenAlex	Large Language Model (LLM)
Author’s name Institution	Address Country Email Sponsorship source

Inputs

OpenAlex

The OpenAlex API requires users to provide a valid study DOI to retrieve results.

Large Language Model (LLM)

The model extracts fields from the full PDF.

Outputs

Text string (suggested extraction field values). Some extraction field values are accompanied by supporting evidence from the study (e.g. verbatim quote citing a funding source in the case of sponsorship suggestions).

Model data

OpenAlex

OpenAlex aggregates and curates data from various sources, including ORCID, ROR, DOAJ, and Pubmed. More details are available here.

Training dataset

Not applicable - We solely rely on querying the OpenAlex data to extract relevant information.

Training approach

Not applicable - We solely rely on querying the OpenAlex data to extract relevant information.

Large language model

We use existing large language models to extract relevant data.

Training dataset

We use samples from a historical dataset to inform model design. Instead of using the data directly, we analyse patterns from the selected dataset to create prompts.

Training approach

We do not conduct direct training; instead, we use existing Large Language Models like Gemini Flash and tailor prompts to guide the model in making accurate predictions.

Evaluation

Approach

To assess suitability, we compared the acceptance rate to typical human accuracy rates reported in various publications: Li et al (2019) and King et al (2024). These papers indicate that human accuracy ranges from 80-85%, based on a 15-20% reported error rate for data extraction.

Sponsorship source

We curated a gold dataset containing 107 open access studies, including a wide range of study types, journals, publishers and publication ages. The dataset contains 64 positive-case studies with at least one funding source and 43 negative-case studies with no funding source reported (either explicitly reported no funding or not reported at all).

We evaluated the funding source extraction model using a combination of an automated LLM judge and human reviewers. Extracted values and supporting quotes were both assessed by an LLM judge and human reviewers.

LLM judgements prioritised which extractions needed human review. Low-priority cases were where extractions matched the ground truth. High-priority cases were where the extraction either missed ground truth data or included data not found in the ground truth.

For extracted values, two human evaluators individually assessed all LLM judgements independently, followed by a consensus stage. They conducted a deep dive on high-priority judgements, verifying against the PDF, and a lighter review on low-priority ones, verifying against the ground truth text.

For supporting quotes, we used an asymmetric review approach, where one human evaluator completed the primary review of all LLM judgements, with deep dives on high‑priority rows, verifying against the PDF, and lighter checks on low‑priority ones, verifying against the ground truth text. The second evaluator then reviewed these judgements, focusing on disagreements and edge cases. Any differences were discussed and resolved by the two reviewers.

All other fields

We recorded user interactions with extraction suggestions in Covidence’s Extraction 1 offering from 8th June 2025 to 4th December 2025 (141,554 data points), focusing on the acceptance and rejection of suggested values.

The acceptance rate was calculated by dividing the number of accepted suggestions by the total number of accepted and rejected suggestions.

Results

Sponsorship source

The performance for values and supporting quotes were:

Attribute	Precision	Recall
Extracted value	92.2% (95% CI: 88.6% – 95.7%)	100.0% (95% CI: 100.0% – 100.0%)
Supporting quotes	87.1% (95% CI: 81.8% – 92.1%)	99.3% (95% CI: 97.7% – 100.0%)

The LLM over-identified (false positive) for a small number of studies. This was primarily due to conflicts of interest and non-financial study sponsors being extracted. The model did not hallucinate when extracting values and supporting quotes.

All other fields

The mean acceptance rate for all study characteristic attributes is 97.44% (95% CI: 96.37%, 98.50%), based on a sample size of 136,081 data-points.

The performance per attribute was:

Attribute	Acceptance rate	Sample size
Author’s name	98.20%	53899
Institution	97.00%	34188
Country	98.00%	28923
Email	98.47%	12256
Address	95.51%	6815

Intended usage & limitations

Benefit & intended usage

The model can be used to provide highly-accurate extraction suggestions to reviewers completing data extraction, saving time and effort when extracting study characteristic fields.

Known limitations

Model limitations

Only available for studies with a DOI linked in Covidence.
Access to the full-text PDF is required to retrieve the Address, Country, Email, and Sponsorship Source fields, either through an accessible and readable open access link or user-uploaded content.
Coverage of extraction values for the Author's name and Institution fields depends on data availability in OpenAlex.

Evaluation limitations

The evaluation for author’s name, institution, country, email, and address are based on interaction with suggestions, which may have influenced human judgments and impacted the acceptance rates.
The evaluation for sponsorship source extractions is based on a curated sample and may not cover every journal style. Performance could dip on unusual layouts or formats.
Sponsorship source extractions are evaluated on English-language papers only. Performance may differ for papers written in other languages.