Logo

What can we help you with today?

Get help straight from our team...

Model details: Extraction suggestions

Models

Model details: Extraction suggestions

Last updated on 17 Jun, 2026

Model information

Description

The study characteristic extraction model automatically extracts a set of study characteristics using a Large Language Model (LLM). It retrieves values for the following data points:

  • Author’s name (first author)

  • Address (first author)

  • Country (in which study was conducted)

  • Email (corresponding author)

  • Institution (first author)

  • Sponsorship source (funding sources)

Inputs

The model extracts fields from the full PDF.

Outputs

Text string (suggested extraction field values). Some extraction field values are accompanied by supporting evidence from the study (e.g. verbatim quote citing a funding source in the case of sponsorship suggestions).

Model data

We use existing large language models to extract relevant data.

Training dataset

We use samples from a historical dataset to inform model design. Instead of using the data directly, we analyse patterns from the selected dataset to create prompts.

Training approach

We do not conduct direct training; instead, we use existing Large Language Models like Gemini Flash and tailor prompts to guide the model in making accurate predictions.

Evaluation

Approach

To assess suitability, we compared the acceptance rate to typical human accuracy rates reported in various publications: Li et al (2019) and King et al (2024). These papers indicate that human accuracy ranges from 80-85%, based on a 15-20% reported error rate for data extraction.

We curated a gold dataset of open access studies, including a wide range of study types, journals, publishers and publication ages.

We evaluated the extraction model using a combination of an automated LLM judge and human reviewers. Extracted values and supporting quotes were both assessed by an LLM judge and human reviewers.

LLM judgements prioritised which extractions needed human review. Low-priority cases were where extractions matched the ground truth. High-priority cases were where the extraction either missed ground truth data or included data not found in the ground truth.

For extracted values, two human evaluators individually assessed all LLM judgements independently, followed by a consensus stage. They conducted a deep dive on high-priority judgements, verifying against the PDF, and a lighter review on low-priority ones, verifying against the ground truth text.

For supporting quotes, we used an asymmetric review approach, where one human evaluator completed the primary review of all LLM judgements, with deep dives on high‑priority rows, verifying against the PDF, and lighter checks on low‑priority ones, verifying against the ground truth text. The second evaluator then reviewed these judgements, focusing on disagreements and edge cases. Any differences were discussed and resolved by the two reviewers.

Results

The performance for extracted values were:

Field

n

Precision

Recall

Author’s name

142

93.7% (95% CI: 89.4% – 97.2%)

100.0% (95% CI: 100.0% – 100.0%)

Address

142

95.7% (95% CI: 92.2% – 98.6%)

99.3%  (95% CI: 97.8% – 100.0%)

Country

142

96.3% (95% CI: 93.1% – 99.3%)

99.2% (95% CI: 97.6 – 100.0)

Email

142

99.3% (95% CI: 97.9% – 100.0%)

100.0% (95% CI: 100.0% – 100.0%)

Institution

142

94.3% (95% CI: 90.0% – 97.9%)

99.3% (95% CI: 97.7% – 100.0%)

Sponsorship source

107

92.2% (95% CI: 88.6% – 95.7%)

100.0% (95% CI: 100.0% – 100.0%)

The performance for supporting quotes were:

Field

n

Precision

Recall

Author’s name

142

98.5% (95% CI: 96.3% – 100.0%)

95.0% (95% CI:91.4% – 97.9)

Address

142

94.4% (95% CI: 90.1% – 97.9%)

100.0% (95% CI:100.0% – 100.0)

Country

142

97.7% (95% CI: 94.9% – 100.0%)

96.2% (95% CI:92.6% – 99.2)

Email

142

97.1% (95% CI: 94.2% – 99.3%)

99.3% (95% CI:97.1% – 100.0)

Institution

142

95.8% (95% CI: 92.3% – 98.6%)

100.0% (95% CI:100.0% – 100.0)

Sponsorship source

107

87.1% (95% CI: 81.8% – 92.1%)

99.3% (95% CI: 97.7% – 100.0%)

Intended usage & limitations

Benefit & intended usage

The model can be used to provide highly-accurate extraction suggestions to reviewers completing data extraction, saving time and effort when extracting study characteristic fields.

Known limitations

Model limitations

  • Access to the full-text PDF is required, either through an accessible and readable open access link or user-uploaded content.

  • Older, locked and scanned PDFs may not be readable, resulting in limited performance for these studies.

Evaluation limitations

  • The evaluation is based on a curated sample and may not cover every journal style. Performance could dip on unusual layouts or formats.

  • Extractions are evaluated on English-language papers only. Performance may differ for papers written in other languages.

Did you find this article helpful?
Previous

Model details: Intervention suggestions

Next