11 AI models, tested on real forensic email classification
A reproducible, multi-test benchmark of 5 cloud and 6 offline AI models on the email-classification tasks that digital forensics and eDiscovery teams run every day—accuracy, multilingual handling, throughput, and cost.
What the benchmark found
Every model beat keyword search
94%+ F1 and 99%+ recall on Test 1, across all 11 models
Cloud or fully offline
From ~$45 per 100K emails, or zero marginal cost on-premises
Multilingual, proven
Perfect scores on the Korean multi-category tasks for the top models
Published
June 3, 2026
11
AI models benchmarked
6
distinct tests
94%+
F1—every model (Test 1)
99%+
recall—every model (Test 1)
Why this matters for forensic and eDiscovery review
Keyword search and manual review leave evidence on the table. The benchmark measured how far AI classification closes that gap.
Keyword and Boolean search
- ⯈ Misses 60–80% of responsive documents in typical production use
- ⯈ Fails on euphemism, code words, and paraphrase—the vocabulary mismatch that routinely breaks Boolean queries
- ⯈ Often returns review sets where 70–90% of flagged documents are false positives
- ⯈ In the Blair and Maron field study, experts reached just 20% recall while believing they had passed 75%
AI classification (this benchmark)
- ⯈ Every retained model reached 99%+ recall on Test 1—miss rates below 1% at the operating point tested
- ⯈ Matched or exceeded the top of the published TAR 2.0 (CAL) range
- ⯈ Top models held precision of 90–99%, so reviewers spend far less time rejecting irrelevant material
- ⯈ Held up against responsive emails written specifically to defeat keyword search through paraphrase and indirection
The bottom line: At the performance levels documented here, AI classification reduces the risk of missed evidence by roughly an order of magnitude versus keyword search.
Three findings that should shape how you use AI classification
The full report runs to six tests and eleven models. These are the conclusions that matter most for case work.
AI beats keyword search—and matches top-tier TAR
Every one of the 11 retained models scored above 94% F1 score on a realistic 2,000-email corpus, with the top five between 98.8% and 99.6%. All reached 99% or higher recall, meeting or exceeding the upper end of the published TAR 2.0 range.
You don’t need the cloud or a big budget
Offline models reached cloud-class accuracy at zero per-email cost and full data isolation. The cheapest cloud path classified 100,000 emails for roughly $45, while one offline model processed nearly 900,000 emails over a weekend at no marginal cost.
No single model wins—fit the model to the matter
Accuracy, cost, speed, data residency, language, and hardware all pull in different directions. The report includes a decision matrix that maps each constraint to a recommended model, so you can match the choice to the case.
Where AI classification sits against keyword search and TAR
Published performance ranges for traditional methods, drawn from the TREC Legal Track and TAR literature, alongside the results from this benchmark.
| Method | Typical recall | Typical precision | Typical F1 |
|---|---|---|---|
| Keyword / Boolean search | 20–40% | 10–79% | ~25–40% |
| Human linear review | 49–54% | 18–20% | ~27–28% |
| TAR 1.0 (SPL / SAL) | 50–75% | 60–80% | ~55–75% |
| TAR 2.0 (CAL) | 75–96% | 80–96% | ~75–96% |
| Aid4Mail AI (this benchmark) | 99%+ | 90–99% | 94–99.6% |
Baseline ranges are synthesized from peer-reviewed eDiscovery literature in the companion report, Quantitative Performance Benchmarks for Keyword Search and Technology-Assisted Review. Benchmark figures cover the 11 retained models on Test 1 (2,000 emails, 6% prevalence).
Choosing a model for your workflow
No single model is best for every matter. The report maps common constraints to a recommended model—a condensed view is below.
| Your constraint | Recommended model |
|---|---|
| Cloud, cost-sensitive, binary task | Gemini 3.1 Flash-Lite |
| Cloud, highest accuracy and coverage | Grok 4.2 Non-Reasoning |
| Cloud, multi-category or multilingual | Gemini 3.1 Flash-Lite (perfect on Tests 2, 3 and 4) |
| Cloud, premium analysis needs | Claude 4.7 Opus (low effort) |
| Air-gapped, multilingual accuracy-first | Qwen 3.6 27B (Dense) |
| Air-gapped, balanced multilingual | Qwen 3.6 35B (MoE) or Gemma 4 26B |
| Air-gapped, binary-task priority | Mistral Small 3.2 24B |
| Air-gapped, maximum English F1 (80 GB+ VRAM) | Llama 3.3 70B |
| Air-gapped, modest GPU (16 GB VRAM) | Ministral 3 14B |
Aid4Mail provides access to every model in the benchmark, and its configuration system lets you switch between them without changing anything else in your workflow.
How the benchmark was run—and what it does not claim
Benchmark results are range indicators, not guarantees. The report states its methods and its limits plainly.
How the benchmark was run
- 11 models retained from up to 40 evaluated—5 commercial cloud, 6 offline
- Six tests: four accuracy tests plus a large-scale Production Pilot and a throughput-and-cost test
- Real John Podesta corpus as the unresponsive background; synthetic and translated responsive content to probe discrimination
- English, French, Spanish, and Korean content to test multilingual performance
- Offline models served locally via Ollama on a single RTX 5090 workstation; cloud models via provider APIs
Honest limitations
- ⯈ Responsive test content was synthetic, designed to exercise AI’s advantage over keyword search; real content may differ
- ⯈ Test 1 prevalence was 6%—very-low-prevalence matters can lower observed precision for any method
- ⯈ Prompt–model fit is real: the same prompt does not perform identically across models
- ⯈ One absence-detection task could not be reliably specified, so its classification results are withheld
- ⯈ Cloud pricing and model availability change; figures reflect the time of testing
Reproduce Test 1 yourself
Don’t take our word for it. The reproduction kit contains the exact Test 1 prompt and the full 2,000-email corpus, so you can re-run the benchmark on your own AI provider, model, and hardware—then compare your figures against the published per-model results.
What’s inside
- The Test 1 Aid4Mail prompt file, with the Responsive, Unresponsive, and INCONCLUSIVE category list
- 2,000 emails in EML format—1,880 real Podesta emails and 120 synthetic insider-threat emails, at 6% responsive prevalence
- License and readme with full step-by-step instructions for configuring and running the test
What you need
- ⯈ Aid4Mail Investigator or Enterprise 6.3.0 or later (AI features are not available in the Converter edition)
- ⯈ A configured AI provider—a cloud API key (Google AI Studio recommended) or a local Ollama or LM Studio model
- ⯈ Roughly $1 or less in cloud cost on the recommended Gemini 3.1 Flash-Lite path, or zero with an offline model
When you publish your own results, identify the model, provider, tool version, prompt, material settings, and date of testing, as set out in the package license.
Guides for configuring and running the test
Read the full report
The complete benchmark—all six tests, per-model analysis, throughput and cost tables, the decision matrix, and the full methodology—is available as a free PDF. No registration required.
Notify me when the benchmark is updated
Models, pricing, and availability change quickly. Leave your email and we’ll send a single message when we re-run the benchmark. No other use, and you can opt out at any time.
Related research
The benchmark draws on two companion papers from Fookes Software.
Keyword Search and TAR Performance Benchmarks
The peer-reviewed performance ranges for keyword search and Technology-Assisted Review used as baselines in this benchmark.
View report → Methodology note · PDFPodesta Corpus Benchmark Methodology Note
How subjective themes resist stable inter-model agreement on the Podesta corpus, with pairwise inter-rater agreement detail.
View note →See how Aid4Mail puts these models to work in AI-powered email analysis.
See what AI classification does on your own data
Run a small validation pass on a representative sample with Aid4Mail’s free trial, then choose the model that fits your matter.