Model details: Intervention suggestions

Model information

Description

The intervention group extraction model automatically extracts a set of intervention groups reported in a study.

Inputs

Large Language Model (LLM)

The model extracts the intervention groups from the full PDF.

Outputs

Text string (suggested extraction field values).

Model data

Large language model

We use existing large language models to extract relevant data.

Training dataset

We use samples from a historical dataset to inform model design. Instead of using the data directly, we analyse patterns from the selected dataset to create prompts.

Training approach

We do not conduct direct training; instead, we use existing Large Language Models like Gemini Flash and tailor prompts to guide the model in making accurate predictions.

Evaluation

Methodology

We evaluated the Large Language Model (LLM) output against a benchmark dataset of 56 open-access studies.

To create the benchmark dataset, four Covidence employees annotated all studies, extracting all of the reported intervention groups from the full-text PDF. Two employees, who were not part of the initial annotation, then reached consensus on the individual annotations, grouping similar intervention groups.

We used our model to extract intervention groups from the benchmark studies. We presented these predictions, along with the benchmark data, to three reviewers for assessment.

Based on their evaluations, we calculated the following metrics:

Precision: The proportion of intervention groups extracted by the model that are correct.
Recall: The proportion of benchmark intervention groups successfully extracted by the model.

To assess suitability, we compared the precision and recall to typical human rates reported in existing publications, such as Tang et al (2025). Tang et al found that human accuracy ranges from ~55-60%, based on a 39.54-44.88% reported error rate for data extraction of results data.

Results

The model achieved a precision of 98.15%, and a recall of 95.16%, well out-performing typical human performance.

Intended usage & limitations

Benefit & intended usage

The model can be used to provide highly-accurate extraction suggestions to reviewers completing data extraction, saving time and effort when extracting intervention groups.

Known limitations

Model limitations

Only available for studies with a DOI linked in Covidence.
Access to the full-text PDF is required, either through an accessible open access link or user-uploaded content.

Evaluation limitations

Limited studies were used to evaluate performance
Studies used for evaluation were exclusively in the medical and health science domain