From Cited Evidence Tables to Forest Plots: Meta-Analysis with AI-Extracted Data

TL;DR

Once extraction data is structured, every section of a systematic review or meta-analysis becomes a column query: study characteristics, subgroup analyses, sensitivity checks, PRISMA tables, and forest plots. Instill AI Collection produces a citation-backed evidence table that can be sanity-checked in chat, exported to CSV, and analyzed reproducibly in R with

metafor

. AI accelerates the handoff; R remains the publication-ready analysis layer.

You’ve screened your papers. You’ve extracted data from each one. Your Instill AI Collection is full of sample sizes, effect sizes, risk-of-bias ratings, and outcome scores. Now comes the part that justifies all that work: writing the actual systematic review (SR) paper and running the meta-analysis (MA).

But here’s what makes this phase frustrating: an SR paper isn’t one big narrative. It’s made up of 15–20 small sections, and each section is essentially a different cross-analysis of the same extraction data. The “Characteristics of Included Studies” table pulls from one set of columns. The forest plot pulls from another. The sensitivity analysis pulls from yet another. In a traditional Excel workflow, every section means building a new pivot table, writing new formulas, or manually re-sorting the same spreadsheet in different ways.

This article shows how Instill AI Collection makes this entire phase faster and less error-prone. If you’re not yet familiar with how Collection works for screening and data extraction, we recommend reading

Systematic Review with AI: Screen and Extract Data from Research Papers in Minutes

first — this article builds on the structured output from that workflow.

Key Takeaways

Every section of an SR/MA paper is a cross-column query — the extraction table already contains all the raw material. Writing the paper becomes a matter of querying the right column pairs.
Preview forest plots in chat, publish from R — the agent generates an inline preview for quick sanity-checking, then produces a reproducible metafor script for the publication-ready version.
Minimal R pipeline — ~10 lines from CSV to a first forest plot — CSV export feeds directly into metafor (version 4.8+). Your final analysis may still require model choices, sensitivity checks, and reviewer judgment.
The structured data is reusable — via Model Context Protocol (MCP) integration, other AI tools can query your extraction data without manual export. Add a new paper, and the analysis updates automatically.
Human-verified data, AI-accelerated analysis — the extraction table includes both AI-filled columns and human-judgment columns (Reviewer Confidence, Notes), ensuring the data feeding your meta-analysis has been human-reviewed.

Every Paper Section Maps to a Column Cross

Here’s something most researchers feel intuitively but rarely see stated explicitly: every section of a systematic review paper is a cross-analysis of two or more columns from the extraction table.

Let’s use a concrete example — a systematic review titled “Effect of Exercise Interventions on Major Depressive Disorder in Adults.” The extraction collection has 23 columns. Each paper section draws from a specific combination:

Paper Section	Columns Used	Output
Table 1: Study Characteristics	First Author & Year × Country × Exercise Type × Sample Size × Mean Age × Depression Measure	The standard “characteristics of included studies” table
Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) Flow Diagram	Screening Decision × Exclusion Reason	How many papers included/excluded and why
Intervention Summary	Exercise Type × Exercise Protocol × Sample Size	Studies used aerobic (n=5), resistance training (n=2), yoga (n=2)…
Overall Effect (Forest Plot)	Post Score × Standard Deviation (SD) × Sample Size (intervention + control)	The core meta-analytic result
Subgroup: By Exercise Type	Exercise Type × Effect Size	Aerobic exercise showed Standardized Mean Difference (SMD) = −0.72 while resistance training showed SMD = −0.55…
Subgroup: By Outcome Measure	Depression Measure × Effect Size	Do studies using Hamilton Depression Rating Scale (HAM-D) show different effects than those using Beck Depression Inventory-II (BDI-II)?
Sensitivity: Risk of Bias	Risk of Bias Notes × Effect Size	Does removing high-risk studies change the overall conclusion?
Publication Bias (Funnel Plot)	Effect Size × Sample Size	Visual test for publication bias
Risk of Bias Summary	Risk of Bias Notes (5 domains) × First Author	The red/yellow/green traffic light table
Inter-rater Reliability	Reviewer A columns × Reviewer B columns	Cohen’s κ for categorical data, Intraclass Correlation Coefficient (ICC) for continuous data

A typical SR/MA paper — following the

PRISMA 2020

27-item checklist (Page et al., 2021, doi:

10.1136/bmj.n71

) — contains 15–20 such sections, each one a structured query against the same extraction table.

Why This Matters for Collection Users

In a chat-based workflow, producing each section requires re-reading papers and re-asking questions. In a Collection-based workflow, the data is already there — you just query different column combinations.

For example, three sequential questions to the agent:

Group the included studies by Exercise Type. What’s the total sample size for each group?

This crosses Exercise Type × Sample Size — the raw material for the Intervention Summary section.

For the Aerobic group, which studies have high risk of bias?

This crosses Exercise Type (filtered to Aerobic) × Risk of Bias Notes — the foundation for a sensitivity analysis.

If I exclude those high-risk studies, what do the remaining effect sizes look like?

This is the actual sensitivity analysis — and the data to answer it is already structured in the collection.

Each question takes seconds to answer because the 345 data points are already extracted, structured, and queryable. No re-reading PDFs. No re-asking the AI to parse the same results table again.

From Collection to Forest Plot: The R Pipeline

The forest plot is the signature output of a meta-analysis — the figure that shows each study’s effect size and the pooled overall effect. Generating one requires exactly the columns that Collection extracts.

Step 1: Export Your Collection

Export the consensus extraction collection (the reconciled dataset from both reviewers) as CSV. The file will have columns like:

First_Author_Year, Exercise_Type, Sample_Size_Intervention, Sample_Size_Control,
Post_Score_Intervention, Post_Score_Control, SD_Post_Intervention, SD_Post_Control, ...

Step 2: Sanity-Check Your Data in Chat

Before opening RStudio, you can validate your extraction data without leaving the conversation. Mention your collection and ask the agent to analyze it — for example, checking data completeness, computing summary statistics, or previewing effect sizes.

The agent loads the collection as a dataframe, runs Python code on the spot, and reports results inline. In the demo below, we ask the agent to inspect the six columns required for meta-analysis and count how many studies have complete numeric data:

This kind of quick validation catches extraction errors early — missing values, text flags where numbers should be, or mismatched column names — before you commit to the formal analysis.

You can go further: ask the agent to compute SMDs, generate a preview forest plot, or produce a reproducible R script using metafor. The same collection data feeds every request. Think of the chat as a scratchpad for exploration — fast, interactive, and disposable.

Step 3: Run the Definitive Meta-Analysis in R

For the publication-ready analysis — the one reviewers and editors will scrutinize — use metafor in R. The metafor package (Viechtbauer, 2010) supports random-effects models, multilevel meta-analyses, meta-regression, and over 20 visualization types including forest plots, funnel plots, and bubble plots. The same CSV export feeds both the in-chat preview and the formal pipeline. The code is remarkably short:

library(metafor)

data <- read.csv("extraction_export.csv")

# Calculate Standardized Mean Differences
data <- escalc(measure = "SMD",
               m1i = Post_Score_Intervention,
               sd1i = SD_Post_Intervention,
               n1i = Sample_Size_Intervention,
               m2i = Post_Score_Control,
               sd2i = SD_Post_Control,
               n2i = Sample_Size_Control,
               data = data)

# Random-effects meta-analysis
model <- rma(yi, vi, data = data)
summary(model)

# Forest Plot — the core figure of any meta-analysis
forest(model, slab = data$First_Author_Year)

# Funnel Plot — test for publication bias
funnel(model)

The minimal pipeline is short — about 10 lines from CSV to a first forest plot. The reason it’s so simple is that Collection already enforces the structure that metafor expects — each column maps directly to a function parameter. Your final analysis may still require model choices, sensitivity checks, and reviewer judgment.

The in-chat preview is for exploration and sanity-checking. The reproducible, publication-ready analysis lives in your R environment. This is by design: AI accelerates your workflow; it doesn’t replace the peer-reviewed statistical tooling your field already trusts.

Step 4: Subgroup and Sensitivity Analyses

Want to see if aerobic exercise works better than yoga? Add one line:

# Subgroup analysis by Exercise Type
model_sub <- rma(yi, vi, mods = ~ Exercise_Type, data = data)
summary(model_sub)

Want to test whether the overall effect holds after removing high-risk studies? Filter and re-run:

# Sensitivity: exclude high-risk studies
low_risk <- subset(data, !grepl("High risk", Risk_of_Bias_Notes))
model_sens <- rma(yi, vi, data = low_risk)
forest(model_sens, slab = low_risk$First_Author_Year)

Each of these analyses maps directly to a section of the final paper. The data pipeline is: Collection → CSV → R → paper figure/table. The in-chat preview lets you iterate quickly on which columns and filters to use; the R script is the audit-ready artifact you submit.

The Before and After

Here’s how the full systematic review workflow compares:

Step	Before (Excel + Covidence)	After (Instill AI Collection)
Read PDF and copy data to spreadsheet	Manual, ~30 min per paper	AI generates cited first-pass extraction; reviewers verify
Trace “where did this number come from?”	Not possible in Excel	Every value links to source paragraph
Dual-reviewer extraction	Two separate Excel files, manual comparison	Four Collections, same schema, automated diff
Cross-paper analysis	Write Excel formulas or pivot tables	Ask the agent in natural language
Prepare data for R	Manually reformat columns and clean data	Export CSV — columns already match `metafor` parameters
Add a new paper to the review	Re-do extraction from scratch	Add one row, autofill runs automatically
Generate PRISMA table	Manual formatting	Ask: “Generate a characteristics-of-included-studies table”

Living Reviews with MCP Integration

Traditional systematic reviews are static — published once, outdated within months as new studies appear. A “living systematic review” aims to keep the evidence current by continuously incorporating new research.

Instill AI Collection enables this through

MCP

integration. Other AI tools — Claude, ChatGPT, Cursor, or custom workflows — can query your Collection data programmatically:

Available via Instill AI MCP:
- query-collection: Retrieve extracted data with filters
- summarize-column: Get statistics per column
- aggregate-by-column: Group by exercise type, compute mean effect sizes

This means your extraction data becomes a live API endpoint. A research assistant using Claude can ask “What’s the current pooled effect size for aerobic exercise interventions?” and get an answer grounded in your structured, citation-backed data — without opening Instill AI or exporting a CSV.

When a new randomized controlled trial (RCT) is published, you add it as a row in the Collection. Autofill extracts the data. You verify the AI output and fill the human-judgment columns (Reviewer Confidence, Notes). The MCP-connected tools immediately see the updated dataset. The analysis stays current without rebuilding anything from scratch.

FAQ

What are the best tools for meta-analysis?

For the statistical analysis itself, metafor (R), Stata’s metan, and RevMan are the established tools. The bottleneck is getting data into those tools: extracting numbers from PDFs into a structured format. Instill AI Collection handles this extraction step — producing CSV output that feeds directly into metafor with no reformatting.

How do I create a forest plot in R?

Export your Instill AI Collection as CSV, then use the metafor package: escalc() to compute standardized mean differences, rma() for the random-effects model, and forest() to render the plot. The minimal pipeline is about 10 lines of R code — see the Step 3 section above for the complete baseline script.

How do I use metafor for forest plots?

Install with install.packages("metafor"), load your CSV with read.csv(), compute effect sizes with escalc(measure="SMD", ...), fit the model with rma(yi, vi, data=data), and call forest(model, slab=data$Study_Label). The key requirement is that your data has separate columns for means, SDs, and sample sizes for each group — which is exactly what Collection’s extraction columns produce.

What is the difference between a systematic review and a meta-analysis?

A systematic review is the complete research process: defining a question, searching databases, screening papers, extracting data, and synthesizing findings. A meta-analysis is the statistical component — pooling effect sizes across studies to compute an overall estimate. Not all systematic reviews include a meta-analysis (some are qualitative), but all meta-analyses require a systematic review as their foundation. Instill AI Collection supports both: the extraction workflow produces the data, and the CSV export feeds the statistical analysis.

Can I update a meta-analysis when new studies are published?

Yes — this is what “living systematic reviews” are for. With Instill AI Collection, you add a new paper as a row, let autofill extract the data, verify the output, and the updated dataset is immediately available via MCP to any connected tool. Re-run your R script on the new CSV export to get an updated forest plot.

Try It Yourself

We’ve built out the full data extraction collections from the exercise-and-depression example — the same collections whose CSV export feeds the R pipeline above. Instill AI is currently in closed beta, so email hello@instill-ai.com for a guided walkthrough of these collections.

Compare Reviewer A and B side by side — the column schemas are identical, but open any column’s property panel and you’ll see different extraction instructions. For example, when a paper reports only confidence intervals instead of standard deviations, Reviewer A flags “CI reported: [12.3, 18.7]” while Reviewer B computes the SD from the CI. This controlled disagreement is how Collection operationalises the Cochrane dual-reviewer requirement.

You can:

Browse the extraction schema and chat with the Collection — ask questions like “which studies have the largest effect sizes?” or “compute the pooled SMD for aerobic studies” and see structured, cited answers.
Clone it into your workspace — customise the extraction instructions for your own meta-analysis topic, add columns for your discipline’s specific outcome measures, and upload your own papers.
Export as CSV — download the structured data and run the R pipeline above in your own environment.

Questions?

Try the product directly — email us with your extraction protocol and we’ll set up a closed-beta workspace with a collection schema designed for your topic on the spot.
Need more help? moto.mo@instill-ai.com or click the chat icon in the bottom-right corner of this page.

The collections in this article use freely available open-access research papers selected to simulate a real systematic review workflow. Some columns (Reviewer Confidence, Notes) are filled by human reviewers after AI extraction — not all columns are AI-autofilled. This hybrid design reflects how Collection works in practice: AI handles extraction, humans handle judgment.

_This article builds on