We Benchmarked 11 AI Models for Email Classification

AI classification is now a standard part of the eDiscovery and digital forensics conversation, but the claims around it are rarely backed by numbers you can check. So we ran the test ourselves. Over several months we benchmarked 11 AI models—5 commercial cloud models and 6 offline models—on realistic email-classification workloads, then measured them against the published performance ranges for keyword search and Technology-Assisted Review (TAR).

The models were drawn from up to 40 we evaluated, and tested across six separate tests covering classification accuracy, multi-category discrimination, Korean-language performance, throughput, and cost at production scale. The unresponsive background came from the real John Podesta email corpus; the responsive signal was synthetic content written specifically to defeat keyword detection.

Three findings stand out.

Every model beat keyword search

The headline result is consistency. Every one of the 11 retained models scored above 94% F1 on a realistic 2,000-email corpus, with the top five clustering between 98.8% and 99.6%. Just as important, every model reached 99% or higher recall—meaning miss rates below 1% at the operating point we tested.

Set that against the published baselines. Keyword and Boolean search typically miss 60–80% of responsive documents in production use. CAL-based TAR 2.0, the best-evidenced production methodology, reaches roughly 75–96% recall. Every model in our benchmark met or exceeded the top of the TAR 2.0 range, and all of them cleared keyword search by a wide margin.

That matters because the responsive emails in the test were deliberately written with paraphrase, indirection, and euphemism in place of the trigger terms a Boolean query would catch—the exact vocabulary-mismatch tactics that break keyword search in real investigations. The AI models were not fooled.

Cloud or fully offline—accuracy without the bill

The second finding is that strong accuracy does not require a large budget or a cloud connection. Offline models running on a single workstation reached cloud-class accuracy at zero per-email cost and with full data isolation, which matters for air-gapped and data-residency-constrained matters.

On the cloud side, the cheapest path classified 100,000 emails for roughly $45. The most expensive cloud model cost more than 25 times that for comparable classification accuracy—a reminder that price and accuracy are only loosely correlated above a certain quality threshold. On the offline side, the fastest model processed nearly 900,000 emails over a single unattended weekend at no marginal cost once the hardware was in place.

No single model wins

The third finding is the one that resists a tidy headline: no single model is best for every task. Accuracy, cost, speed, data residency, language coverage, and available hardware all pull in different directions. A 500,000-email triage run on a routine regulatory matter has different requirements than a small, high-stakes matter under a strict data-residency obligation.

To make that practical rather than abstract, the full report includes a decision matrix that maps common constraints—cost-sensitive cloud triage, air-gapped multilingual work, modest-GPU deployments, premium analysis needs—to a recommended model for each. Because Aid4Mail provides access to every model in the benchmark and lets you switch between them without changing anything else in your workflow, the matrix translates directly into a configuration choice.

What this means for your matters

The benchmark is evidence that AI classification has matured into a defensible, measurable alternative to keyword culling—and a strong complement to TAR. But two caveats are worth stating plainly, because the report does not hide them. The responsive test content was synthetic, designed to exercise AI’s discrimination advantage, so real-world content may behave differently. And prompt quality matters enormously: one task we could not reliably specify produced wildly divergent results across models, which is why we withheld its accuracy figures entirely.

The takeaway for practitioners is not “AI replaces judgment.” It is that a well-specified prompt on a tractable task now produces reliable, defensible results across most current models—and that you should validate your prompt on a representative sample before relying on it for a decision of consequence.

And because reproducibility matters in this field, we’ve published a downloadable kit containing the exact Test 1 prompt and the full 2,000-email corpus, so you can re-run the benchmark on your own provider and hardware and compare your figures against ours.

Read the full benchmark analysis

All six tests, per-model results, throughput and cost tables, and the complete decision matrix—free, no registration required.

Read the full analysis

For the methodology, the per-model breakdowns, and the side-by-side comparison against keyword search and TAR, see the AI Model Benchmark page, or read more about how Aid4Mail puts these models to work in AI-powered email analysis.

By Use Case

By Industry

Aid4Mail Editions

Comparisons

Key Features

Research & Reports

Forensic Suite Pairing

eDiscovery Platform Pairing

Documentation

Learning Center

Get Help

Downloads

About Us

Partners

Every model beat keyword search

Cloud or fully offline—accuracy without the bill

No single model wins

What this means for your matters

Read the full benchmark analysis

Tags

by Eric Fookes

Related Posts

Aid4Mail 6: A Revolution in Digital Forensics and eDiscovery

Aid4Mail 6.2 Launched: Adds Full Python Scripting and Expanded AI Platform Support

Advanced Aid4Mail Features That Most Users Miss