Skip to main content
Technical Guide

EDRM DupeID and MIH Guide

Master cross-platform email deduplication with EDRM Message Identification Hash standards. Learn how Aid4Mail’s MIH+ implementation enables seamless deduplication across forensic and eDiscovery platforms—saving time, money, and eliminating vendor lock-in.

100%

Email Coverage

30-60%

Cost Savings

EDRM

Certified Standard

Cross-Platform

Vendor Freedom

In digital forensics and eDiscovery, duplicate emails can inflate costs by 30–60%. When investigations span multiple platforms—RelativityOne, Nuix, EnCase, X-Ways—traditional tools force you to reprocess everything through a single vendor, wasting time and money.

The Electronic Discovery Reference Model (EDRM) solved this with the DupeID project, creating a universal standard for email identification. Aid4Mail was one of the first forensic tools to implement this standard—and went further with MIH+, an enhancement that guarantees deduplication coverage for 100% of emails, not just 80–90%.

Why This Matters

No vendor lock-in — Use the best tool for each task
Massive cost savings — Reduce review by 30–60%
Cross-platform compatibility — Share datasets seamlessly
Industry standard — EDRM-certified methodology

This guide explains the EDRM MIH standard, Aid4Mail’s MIH+ implementation, and how to use these tools to transform your email investigations.

1

Understanding EDRM and the DupeID Project

What is the EDRM?

The Electronic Discovery Reference Model (EDRM) is a globally recognized framework that defines best practices for handling electronic data in legal proceedings. Developed by industry leaders, it guides organizations through eight distinct phases:

1

Identification

Locating potential sources of ESI

2

Preservation

Ensuring data integrity

3

Collection

Gathering relevant data

4

Processing

Preparing data for review

5

Review

Evaluating data for relevance

6

Analysis

Deeper examination of evidence

7

Production

Delivering evidence in required formats

8

Presentation

Displaying evidence in court

The DupeID Project: Solving Deduplication

In February 2023, EDRM launched the Duplicate Identification (DupeID) Project, led by Beth Patterson. Its mission: create a standardized, cross-platform method for generating unique identifiers for email messages.

Goals of DupeID:

  • Enable consistent email identification across different systems
  • Support deduplication without vendor lock-in
  • Facilitate cross-referencing and comparison of datasets
  • Reduce costs and timelines in investigations
  • Create an open standard complementing proprietary methods

The outcome was the EDRM Message Identification Hash (MIH)—a simple yet elegant solution to a decades-old problem.

2

The Cross-Platform Deduplication Challenge

The Problem: Vendor Lock-In and Data Redundancy

In multi-custodian investigations involving multiple email platforms, duplicate emails dramatically inflate data volumes. A typical case might involve:

  • Multiple email platforms (Gmail, Microsoft 365, Yahoo, IMAP servers)
  • Various mailbox formats (PST, OST, mbox, EML, MSG)
  • Different custodian accounts across organizations
  • Historical archives from legacy systems

The result? Massive redundancy. The same email thread might appear dozens of times across different custodians, formats, and systems.

The Historical Solution: Costly Reprocessing

Historically, specialized tools offered deduplication—but only within their own proprietary ecosystems. Each vendor used unique algorithms that couldn’t communicate across platforms.

The traditional approach required:

  1. Collect data from all sources
  2. Ingest everything into a single platform
  3. Deduplicate within that platform’s ecosystem
  4. Accept vendor lock-in or face costly data migration

The Cost of the Status Quo

If you needed to compare datasets from different vendors—Nuix, RelativityOne, and X-Ways—the only option was to reprocess all data through a single platform.

  • Weeks of additional processing time
  • Exponentially higher infrastructure costs
  • Massive storage requirements
  • Vendor dependency for entire case lifecycle

The Real-World Impact

Consider a typical multi-custodian investigation:

10 custodians with 50,000 emails each = 500,000 total emails
Estimated duplication rate: 30–60% (industry standard)
Without cross-platform deduplication: Review 500,000 emails
With cross-platform deduplication: Review 200,000–350,000 emails

Cost Implications:

• Review at $1–$3 per email = $500,000–$1,500,000 without deduplication

• Review with deduplication = $200,000–$1,050,000

💰 Savings: $300,000–$450,000 on a single case

The lack of a standardized, cross-platform method for email identification was costing the industry billions of dollars annually.

3

The EDRM Message Identification Hash (MIH)

What is the EDRM MIH?

The EDRM MIH is an MD5 hash value generated from the Message-ID field in an email’s SMTP header:

EDRM MIH = MD5(Message-ID)

Why the Message-ID Field?

The Message-ID is a unique identifier assigned to emails by mail servers when messages are sent. Defined in RFC 822 (and later RFC 2822 and RFC 5322), it’s designed to be globally unique:

Message-ID: <20250315123045.abc123@mail.example.com>

Key characteristics:

  • Globally unique across all email systems
  • Assigned at message creation time
  • Preserved when emails are forwarded, replied to, or migrated
  • Present in the vast majority of received emails

By hashing this field with MD5, the EDRM MIH creates a 128-bit fingerprint that uniquely identifies an email across any platform or vendor.

The MIH’s Limitation: Null Values

While elegant, the EDRM MIH specification has one significant limitation:

Critical Limitation

Emails without a Message-ID field produce a null value.

This affects:

  • Draft emails (not yet sent, no Message-ID assigned)
  • Outgoing messages (some systems don’t preserve Message-ID)
  • Corrupted or incomplete email headers
  • Certain proprietary email formats

📊 Impact: In a typical collection, 10–20% of messages may lack a Message-ID, making them unidentifiable using standard EDRM MIH alone.

This creates a significant gap in cross-platform deduplication capabilities—a gap that Aid4Mail’s MIH+ was designed to close.

4

Aid4Mail’s MIH+ Implementation

Introducing MIH+: Guaranteed Non-Null Hash Values

To address the limitations of EDRM MIH while maintaining full compatibility with the standard, Aid4Mail developed MIH+—an enhanced variant that guarantees a non-null value for every email.

MIH+ Algorithm

1. For emails WITH a Message-ID field:

MIH+ = EDRM MIH (identical)

MIH+ = MD5(Message-ID)

2. For emails WITHOUT a Message-ID:

MIH+ uses alternative metadata:

Preferred method: MIH+ = MD5(sender + date + subject)
Fallback method (if sender, date, or subject missing): MIH+ = MD5(entire SMTP header)

How Aid4Mail Generates MIH+ for Messages Without Message-ID

When the Message-ID field is missing, Aid4Mail constructs a hash source using available metadata in a specific order of precedence:

Sender Field

  1. From field
  2. Sender field
  3. Reply-To field

Date Field

  1. Date field
  2. Most recent Received field (topmost)

Subject Field

Subject field (no fallback)

Example: For a draft email without a Message-ID:

From: alice@company.com
Date: 2025-03-15 14:30:00
Subject: Q1 Budget Review

MIH+ = MD5("alice@company.com" + "2025-03-15 14:30:00" + "Q1 Budget Review")

This approach ensures:

  • Every email gets a unique identifier
  • Cross-platform comparison is always possible
  • Deduplication can occur for all message types

MIH+ vs. Standard MIH: Compatibility

Critical Compatibility Point

  • For emails with a Message-ID, MIH+ produces identical values to EDRM MIH
  • This ensures full interoperability with other vendors supporting EDRM MIH
  • Only emails without a Message-ID produce different values—and in those cases, standard MIH would have returned null anyway

✅ Result: Aid4Mail’s MIH+ extends the EDRM standard without breaking compatibility, enabling true cross-platform deduplication for 100% of emails rather than just 80–90%.

Performance-Optimized Architecture

While MIH+ provides critical compatibility and cross-platform identification capabilities, Aid4Mail uses a performance-optimized approach for actual deduplication operations during processing.

Why not use MIH+ directly for deduplication?

MD5 hash generation and comparison, while reliable, is computationally expensive when processing hundreds of thousands or millions of emails. Aid4Mail implements a two-tier hashing system:

High-Speed Int64 Hash

Purpose: Internal deduplication

  • Lightning-fast comparisons
  • 64-bit integer format
  • Optimal memory usage
  • Handles massive datasets

MIH+ MD5 Hash

Purpose: Cross-platform compatibility

  • Export & metadata
  • Search & file naming
  • EDRM compliance
  • Vendor interoperability

Benefits of this architecture:

  • 10× faster deduplication during processing
  • Minimal memory footprint for large collections
  • Full EDRM MIH+ compatibility for cross-platform workflows
  • Best of both worlds: performance AND interoperability
5

Using MIH+ in Aid4Mail

Aid4Mail provides comprehensive MIH+ support across multiple features, enabling forensic examiners and eDiscovery professionals to leverage EDRM standards throughout their workflows.

1. Search and Filtering on MIH+ Values

Available in: Aid4Mail Investigator and Enterprise editions

The MIH_Plus search operator enables precise filtering based on MIH+ hash values. This is particularly powerful when working with datasets from multiple vendors.

Basic syntax:

MIH_Plus:7b7e8488d0b11ff6dd30064fa5ff79c1

Advanced syntax with search lists:

MIH_Plus:{exact=C:\Cases\Case-001\MIH-List.txt}

Use cases:

  1. Deduplication across vendors: Import MIH+ values from RelativityOne, Nuix, or other platforms and exclude matching emails from Aid4Mail processing
  2. Targeted collection: Use MIH+ lists to identify and collect specific emails across multiple custodians
  3. Cross-reference verification: Confirm that specific messages exist in both your dataset and a vendor’s production
  4. Privileged document tracking: Maintain MIH+ lists of privileged communications and automatically filter them across all custodians

Example Workflow:

  1. Export MIH+ values from Relativity for already-reviewed emails
  2. Save as MIH-Reviewed.txt
  3. In Aid4Mail, use: NOT MIH_Plus:{exact=C:\MIH-Reviewed.txt}
  4. Process only emails not yet reviewed in Relativity

Result: Eliminates redundant processing and review, saving significant time and cost.

2. File Naming with MIH+ Signatures

When exporting emails to .eml, .msg, or .txt formats, Aid4Mail can use MIH+ values as file names.

Advantages:

  • Consistent naming: Same email always has the same filename across all exports
  • Cross-platform identification: Files can be matched across different processing runs
  • No length limitations: Unlike subject-based naming, MIH+ produces fixed-length 32-character names
  • Special character handling: No illegal characters or truncation issues
  • Deduplication at OS level: Operating system tools can identify duplicates by filename

How to enable:

In Aid4Mail session settings:

Target > File name > Use MIH+ signature

Resulting file names:

7b7e8488d0b11ff6dd30064fa5ff79c1.eml
3d4f2a1e9c8b7a6d5e4f3a2b1c0d9e8f.msg
a1b2c3d4e5f6a7b8c9d0e1f2a3b4c5d6.txt

3. Template Token: {MIH_Plus}

The {MIH_Plus} template token inserts MIH+ values into custom file and folder names, email headers, and metadata exports.

Example uses:

Custom file naming:

{CustodianName}_{MIH_Plus}.eml

Result: Alice-Smith_7b7e8488d0b11ff6dd30064fa5ff79c1.eml

Metadata extraction to CSV:

Column Configuration: Subject, From, Date, MIH_Plus

Adding X-EDRM-MIH header to emails:

Email Header Configuration: Include X-EDRM-MIH field

4. Metadata Extraction

Aid4Mail’s Column Configuration Editor includes the EDRM.MIH_Plus token for extracting MIH+ values to CSV, XML, JSON, and TSV files.

Typical metadata export:

Subject,From,Date,MIH_Plus,Folder
Q1 Budget Review,alice@company.com,2025-03-15,7b7e8488...,Inbox
Re: Q1 Budget Review,bob@company.com,2025-03-16,3d4f2a1e...,Sent Items

This enables:

  • Cross-platform deduplication in Excel or databases
  • Custom analysis scripts using MIH+ as the join key
  • Import into other tools for further processing

5. Exporting MIH+ Lists

Aid4Mail can generate plain-text lists of MIH+ values—ideal for sharing with other platforms or vendors.

How to export:

  1. Select Plain Text as target format
  2. Enable Export to a single text file
  3. Under Email Header Configuration, choose Only EDRM MIH values
  4. Apply desired filters to define scope
  5. Process

Result: A text file with one MIH+ value per line

7b7e8488d0b11ff6dd30064fa5ff79c1
3d4f2a1e9c8b7a6d5e4f3a2b1c0d9e8f
a1b2c3d4e5f6a7b8c9d0e1f2a3b4c5d6

Use cases:

  • Share with opposing counsel for agreed deduplication
  • Import into Relativity, Nuix, or other platforms
  • Create privilege logs or production indexes
  • Document chain of custody
6

Real-World Benefits and ROI

1. Cost Savings

Reduced Review Costs

  • 30–60% fewer emails to review
  • $1–$3 per email cost eliminated for duplicates
  • Faster case resolution

Example Case:

500,000 emails with 40% duplication

Savings: $400,000

Reduced Hosting Costs

  • 30–60% smaller datasets
  • Lower cloud platform fees
  • Reduced infrastructure needs

Hosting Savings:

$27,000–$45,000 per TB annually

1 TB → 100 GB = $24–$41K saved

Eliminated Reprocessing Costs

  • No need to reprocess datasets from multiple vendors
  • Avoid costly platform lock-in
  • Maintain flexibility across tools and systems

2. Time Efficiency

  • Faster Investigations: Immediate deduplication across all datasets without waiting for single-platform reprocessing
  • Accelerated Review: Smaller, deduplicated datasets enable focus on unique, responsive content
  • Streamlined Workflows: Consistent identifiers across all tools simplify communication with co-counsel

3. Improved Accuracy and Defensibility

Complete Coverage

  • MIH+ handles 100% of emails
  • No gaps in deduplication
  • Comprehensive chain of custody

EDRM Compliance

  • Industry-standard identification
  • Recognized by courts
  • Defensible in proceedings

Cross-Platform Integrity

  • Same hash values across systems
  • Verifiable methodology
  • Transparent and auditable

4. Flexibility and Vendor Independence

  • Multi-Platform Workflows: Use best-in-class tools for each task without forced platform lock-in
  • Collaborative Investigations: Share datasets with confidence and coordinate with multiple law firms
  • Future-Proofing: EDRM standard ensures long-term compatibility and protects investment in processed data
7

Real-World Use Cases

Multi-Vendor Litigation

Scenario:

A law firm handling complex litigation involving:

  • 20 custodians across three companies
  • Email data already processed in RelativityOne for Company A
  • Company B’s data in Nuix Workstation
  • Company C using internal tools

Challenge:

How to deduplicate across all three datasets without reprocessing?

Solution with Aid4Mail MIH+:

  1. Export MIH+ lists from RelativityOne (Company A)
  2. Export MIH+ lists from Nuix (Company B)
  3. Process Company C data with Aid4Mail, excluding matches
  4. Ingest only unique Company C emails into review platform

Result:

  • ✅ 60% reduction in Company C review volume
  • ✅ $300,000 saved in hosting and review costs
  • ✅ Two weeks faster case resolution

Government Investigation with Multiple Agencies

Scenario:

A regulatory investigation involving FBI (EnCase), SEC (custom tools), and DOJ (Aid4Mail)

Challenge:

Multiple agencies need to coordinate without duplicating review efforts

Solution with Aid4Mail MIH+:

  1. FBI exports MIH+ list of already-reviewed emails
  2. SEC exports MIH+ list from their system
  3. DOJ uses Aid4Mail to process new custodian data, excluding reviewed items
  4. All agencies maintain separate MIH+ lists for coordination

Result:

  • ✅ No duplicate review across agencies
  • ✅ Faster investigation timeline
  • ✅ Clear audit trail for all parties
  • ✅ Defensible methodology for court proceedings

M&A Due Diligence

Scenario:

Company acquiring a competitor needs to review 10 years of email archives from multiple legacy systems (Exchange, Gmail, PST files)

Challenge:

Avoid re-reviewing emails already cleared by target company’s counsel

Solution with Aid4Mail MIH+:

  1. Receive MIH+ list of cleared emails from target company
  2. Collect all email sources with Aid4Mail
  3. Filter using MIH+ search to exclude cleared messages
  4. Focus review on new, uncleared content

Result:

  • ✅ 70% reduction in review volume
  • ✅ $500,000 saved in due diligence costs
  • ✅ Three-week faster deal closure
8

Getting Started with MIH+ in Aid4Mail

Edition Requirements

MIH+ functionality is available in all Aid4Mail editions:

Aid4Mail Converter

299/year

Basic MIH+ support

  • File naming
  • Metadata extraction
  • Template tokens

Aid4Mail Investigator

999/year

Full MIH+ features

  • All Converter features
  • Search operators
  • Search lists

Aid4Mail Enterprise

4999/year

Unlimited scale

  • All Investigator features
  • CLI automation
  • Batch processing

Basic Configuration

Aid4Mail’s default configuration already generates MIH+ values optimally. No special setup is required.

Default Configuration:

App Settings > Sessions > File naming & duplicate detection
Generate hash value from: Message-ID header (EDRM MIH)

Using MIH+ for Cross-Platform Deduplication

Scenario: You want to exclude emails already reviewed in RelativityOne.

  1. 1

    Export MIH+ values from Relativity

    (or request from vendor)

  2. 2

    Save as a text file

    One MIH+ value per line:

    C:\Cases\Case-001\Relativity-MIH.txt
  3. 3

    In Aid4Mail, add to your filter script

    Session Settings > Filters > Item filtering > Search query:

    NOT MIH_Plus:{exact=C:\Cases\Case-001\Relativity-MIH.txt}
  4. 4

    Process normally

    Aid4Mail will exclude all matching emails

Exporting MIH+ Lists

To create a MIH+ list from your Aid4Mail collection:

  1. Select Plain Text as target format
  2. Enable Export to a single text file
  3. Under Email header configuration, choose Only EDRM MIH values
  4. Apply desired filters (e.g., Class:responsive if using AI classification)
  5. Process

Result: A text file with one MIH+ value per line, ready to share with other platforms.

9

Frequently Asked Questions

What’s the difference between MIH and MIH+?

EDRM MIH generates hash values from the Message-ID field. If an email lacks a Message-ID (drafts, outgoing messages), it returns a null value.

MIH+ is Aid4Mail’s enhancement that guarantees a non-null hash for every email by using alternative metadata (sender + date + subject) when Message-ID is missing. For emails with Message-ID, MIH+ produces identical values to EDRM MIH, ensuring full compatibility.

Can I use Aid4Mail’s MIH+ with other eDiscovery platforms?

Yes, absolutely. Aid4Mail’s MIH+ is fully compatible with the EDRM MIH standard. You can:

  • Export MIH+ lists and import them into Relativity, Nuix, or other platforms
  • Import MIH lists from other vendors and use them in Aid4Mail searches
  • Share MIH+ values with opposing counsel or co-counsel
  • Use MIH+ for cross-platform deduplication without data loss
Does using MIH+ slow down processing?

No. Aid4Mail uses a dual-hash architecture:

  • During processing: Aid4Mail uses a high-speed Int64 hash for lightning-fast deduplication (10× faster than MD5)
  • For export/compatibility: MIH+ MD5 hashes are generated on-demand for cross-platform use

This ensures you get the best of both worlds: maximum speed during processing and full EDRM compatibility for cross-platform workflows.

Which Aid4Mail edition do I need for MIH+ search operators?

Aid4Mail Investigator or Enterprise.

MIH+ search operators (like MIH_Plus:{exact=file.txt}) require advanced filtering capabilities available in Investigator (999/year) or Enterprise (4999/year) editions.

Aid4Mail Converter (299/year) supports MIH+ for file naming, metadata extraction, and template tokens, but not search operators.

How much can I save using MIH+ deduplication?

Typical savings range from $300,000 to $500,000 per case.

This comes from:

  • 30–60% reduction in review volume = lower attorney fees
  • 80–90% hosting cost savings = $24,000–$41,000 per TB annually
  • Eliminated reprocessing costs across vendors
  • Faster case resolution = weeks saved

Example: A 500,000-email case with 40% duplication and $2/email review cost saves approximately $400,000 using MIH+ deduplication.

Is EDRM MIH accepted in court?

Yes. The EDRM is a globally recognized framework developed by industry leaders and accepted by courts worldwide. The MIH standard:

  • Follows industry best practices
  • Provides transparent, auditable methodology
  • Maintains defensible chain of custody
  • Complies with FRCP, GDPR, and other regulations

Aid4Mail’s MIH+ implementation extends the standard while maintaining full compatibility, ensuring defensibility in legal proceedings.

Can I use MIH+ with emails that don’t have Message-IDs?

Yes—that’s exactly what MIH+ was designed for.

Standard EDRM MIH returns null values for emails without Message-IDs (drafts, outgoing messages, corrupted headers). MIH+ solves this by:

  • Using alternative metadata (sender + date + subject) to generate hash values
  • Ensuring 100% of emails have identifiable hash values
  • Maintaining compatibility with EDRM MIH for emails that do have Message-IDs

This means you get complete deduplication coverage instead of the 80–90% coverage of standard MIH.

Ready to Transform Your Email Deduplication?

Experience the power of EDRM MIH+ with Aid4Mail. Eliminate vendor lock-in, reduce costs by 30–60%, and deduplicate 100% of your emails across any platform.