Reproducible benchmark

11 AI models, tested on real forensic email classification

A reproducible, multi-test benchmark of 5 cloud and 6 offline AI models on the email-classification tasks that digital forensics and eDiscovery teams run every day—accuracy, multilingual handling, throughput, and cost.

Read the full report (PDF) Start free trial

Real John Podesta corpus

Cloud and offline models

English, French, Spanish, Korean

What the benchmark found

Every model beat keyword search

94%+ F1 and 99%+ recall on Test 1, across all 11 models

Cloud or fully offline

From ~$45 per 100K emails, or zero marginal cost on-premises

Multilingual, proven

Perfect scores on the Korean multi-category tasks for the top models

Published

June 3, 2026

AI models benchmarked

distinct tests

94%+

F1—every model (Test 1)

99%+

recall—every model (Test 1)

Why this matters for forensic and eDiscovery review

Keyword search and manual review leave evidence on the table. The benchmark measured how far AI classification closes that gap.

Keyword and Boolean search

⯈ Misses 60–80% of responsive documents in typical production use
⯈ Fails on euphemism, code words, and paraphrase—the vocabulary mismatch that routinely breaks Boolean queries
⯈ Often returns review sets where 70–90% of flagged documents are false positives
⯈ In the Blair and Maron field study, experts reached just 20% recall while believing they had passed 75%

AI classification (this benchmark)

⯈ Every retained model reached 99%+ recall on Test 1—miss rates below 1% at the operating point tested
⯈ Matched or exceeded the top of the published TAR 2.0 (CAL) range
⯈ Top models held precision of 90–99%, so reviewers spend far less time rejecting irrelevant material
⯈ Held up against responsive emails written specifically to defeat keyword search through paraphrase and indirection

The bottom line: At the performance levels documented here, AI classification reduces the risk of missed evidence by roughly an order of magnitude versus keyword search.

Three findings that should shape how you use AI classification

The full report runs to six tests and eleven models. These are the conclusions that matter most for case work.

AI beats keyword search—and matches top-tier TAR

Every one of the 11 retained models scored above 94% F1 score on a realistic 2,000-email corpus, with the top five between 98.8% and 99.6%. All reached 99% or higher recall, meeting or exceeding the upper end of the published TAR 2.0 range.

You don’t need the cloud or a big budget

Offline models reached cloud-class accuracy at zero per-email cost and full data isolation. The cheapest cloud path classified 100,000 emails for roughly $45, while one offline model processed nearly 900,000 emails over a weekend at no marginal cost.

No single model wins—fit the model to the matter

Accuracy, cost, speed, data residency, language, and hardware all pull in different directions. The report includes a decision matrix that maps each constraint to a recommended model, so you can match the choice to the case.

Where AI classification sits against keyword search and TAR

Published performance ranges for traditional methods, drawn from the TREC Legal Track and TAR literature, alongside the results from this benchmark.

Method	Typical recall	Typical precision	Typical F1
Keyword / Boolean search	20–40%	10–79%	~25–40%
Human linear review	49–54%	18–20%	~27–28%
TAR 1.0 (SPL / SAL)	50–75%	60–80%	~55–75%
TAR 2.0 (CAL)	75–96%	80–96%	~75–96%
Aid4Mail AI (this benchmark)	99%+	90–99%	94–99.6%

Baseline ranges are synthesized from peer-reviewed eDiscovery literature in the companion report, Quantitative Performance Benchmarks for Keyword Search and Technology-Assisted Review. Benchmark figures cover the 11 retained models on Test 1 (2,000 emails, 6% prevalence).

Choosing a model for your workflow

No single model is best for every matter. The report maps common constraints to a recommended model—a condensed view is below.

Your constraint	Recommended model
Cloud, cost-sensitive, binary task	Gemini 3.1 Flash-Lite
Cloud, highest accuracy and coverage	Grok 4.2 Non-Reasoning
Cloud, multi-category or multilingual	Gemini 3.1 Flash-Lite (perfect on Tests 2, 3 and 4)
Cloud, premium analysis needs	Claude 4.7 Opus (low effort)
Air-gapped, multilingual accuracy-first	Qwen 3.6 27B (Dense)
Air-gapped, balanced multilingual	Qwen 3.6 35B (MoE) or Gemma 4 26B
Air-gapped, binary-task priority	Mistral Small 3.2 24B
Air-gapped, maximum English F1 (80 GB+ VRAM)	Llama 3.3 70B
Air-gapped, modest GPU (16 GB VRAM)	Ministral 3 14B

Aid4Mail provides access to every model in the benchmark, and its configuration system lets you switch between them without changing anything else in your workflow.

How the benchmark was run—and what it does not claim

Benchmark results are range indicators, not guarantees. The report states its methods and its limits plainly.

How the benchmark was run

11 models retained from up to 40 evaluated—5 commercial cloud, 6 offline
Six tests: four accuracy tests plus a large-scale Production Pilot and a throughput-and-cost test
Real John Podesta corpus as the unresponsive background; synthetic and translated responsive content to probe discrimination
English, French, Spanish, and Korean content to test multilingual performance
Offline models served locally via Ollama on a single RTX 5090 workstation; cloud models via provider APIs

Honest limitations

⯈ Responsive test content was synthetic, designed to exercise AI’s advantage over keyword search; real content may differ
⯈ Test 1 prevalence was 6%—very-low-prevalence matters can lower observed precision for any method
⯈ Prompt–model fit is real: the same prompt does not perform identically across models
⯈ One absence-detection task could not be reliably specified, so its classification results are withheld
⯈ Cloud pricing and model availability change; figures reflect the time of testing

Reproduce Test 1 yourself

Don’t take our word for it. The reproduction kit contains the exact Test 1 prompt and the full 2,000-email corpus, so you can re-run the benchmark on your own AI provider, model, and hardware—then compare your figures against the published per-model results.

What’s inside

The Test 1 Aid4Mail prompt file, with the Responsive, Unresponsive, and INCONCLUSIVE category list
2,000 emails in EML format—1,880 real Podesta emails and 120 synthetic insider-threat emails, at 6% responsive prevalence
License and readme with full step-by-step instructions for configuring and running the test

What you need

⯈ Aid4Mail Investigator or Enterprise 6.3.0 or later (AI features are not available in the Converter edition)
⯈ A configured AI provider—a cloud API key (Google AI Studio recommended) or a local Ollama or LM Studio model
⯈ Roughly $1 or less in cloud cost on the recommended Gemini 3.1 Flash-Lite path, or zero with an offline model

Download the reproduction kit Windows installer (.exe) · 8,054 KB · use governed by the included license

When you publish your own results, identify the model, provider, tool version, prompt, material settings, and date of testing, as set out in the package license.

Guides for configuring and running the test

Read the full report

The complete benchmark—all six tests, per-model analysis, throughput and cost tables, the decision matrix, and the full methodology—is available as a free PDF. No registration required.

Download the report (PDF) PDF · opens in a new tab

Notify me when the benchmark is updated

Models, pricing, and availability change quickly. Leave your email and we’ll send a single message when we re-run the benchmark. No other use, and you can opt out at any time.

Related research

The benchmark draws on two companion papers from Fookes Software.

Companion report · PDF

Keyword Search and TAR Performance Benchmarks

The peer-reviewed performance ranges for keyword search and Technology-Assisted Review used as baselines in this benchmark.

View report → Methodology note · PDF

Podesta Corpus Benchmark Methodology Note

How subjective themes resist stable inter-model agreement on the Podesta corpus, with pairwise inter-rater agreement detail.

View note →

See how Aid4Mail puts these models to work in AI-powered email analysis.

See what AI classification does on your own data

Run a small validation pass on a representative sample with Aid4Mail’s free trial, then choose the model that fits your matter.

Start Free Trial Contact Sales

By Use Case

By Industry

Aid4Mail Editions

Comparisons

Key Features

Research & Reports

Forensic Suite Pairing

eDiscovery Platform Pairing

Documentation

Learning Center

Get Help

Downloads

About Us

Partners