Published in issue #001 of The Strategist, Strategic Discovery's quarterly e-newsletter

Deconstructing Deduplication

By Amy Hirschkron

With the soaring costs of processing and reviewing electronic data, removing duplicative data from a collection is clearly of great value. Deduplication not only considerably lowers the cost of processing and review, it also significantly shortens review time and reduces the possibility of producing identical documents with contradictory review calls.

“But what exactly constitutes a duplicate file or email?” you ask. The Federal Rules of Evidence offer the following, “[a] “duplicate” is a counterpart produced by the same impression as the original, or from the same matrix,…or by mechanical or electronic re-re-recording, or by chemical reproduction, or by other equivalent techniques which accurately reproduces the original.” F.R.E. 1001(4). Crystal clear now, right? Or maybe not.

Are two virtually identical emails, one with several extra spaces in the body, to be considered duplicates? Another such example of “functional” duplicates would be an email sent back and forth several times between two parties. Each copy of the email contains portions of the conversation, but only the final email in the chain contains the entire thread. Is it too much of a stretch to conclude that all prior emails making up the thread are functional or “near” duplicates of the final email and could therefore be removed from the dataset? If the contents of two files are the same, but the names of the files are not the same, are those two files duplicates? The F.R.E. provides a guideline, but you must make these calls. How you define a duplicate in a specific case will lead you to the selection of the appropriate deduplication method.

There are three methods typically used for literal deduplicating (deduping, as it is affectionately called), 1) comparing hash values, 2) comparing file metadata, and 3) comparing hash values of selected metadata fields or text, such as the body of an email.

A hash value is an algorithm, an electronic fingerprint, calculated from the contents of a file. There are numerous hashing algorithms. The two most commonly-used algorithms for electronic discovery deduplication are MD5 and SHA-1. The MD5 hash calculates 128 bit values and the SHA-1 hash calculates 160 bit values. The chance that different documents resolve to the same hash values is infinitesimal, so by comparing hash values you can identify documents with identical contents.

Two non-email files with identical content but different filenames or timestamps will have the same hash value. In cases where it is not desirable to remove such files from the review set, it would be more appropriate to compare standalone files based on metadata.

Metadata is data about data. It documents data attributes such as name, size, and type; it records data structures such as length, fields, and columns; and it details data properties such as where data is located, how it is associated, and who owns it.

Deduplication based on comparison of metadata is particularly effective for email, where minute differences in formatting not apparent to the user can cause changes in hash values. Deduping based on metadata will have varying results, depending on the metadata that is compared. For instance, if the number of attachments for email is not one of the fields that is compared, two emails with identical content, one with an attachment and one without, sent by the same party to two different people, may be considered duplicates.

After you have decided that you would like to dedupe and have defined what constitutes a duplicate in your case, you must decide if you would you like to globally dedupe. Global deduplication refers to removing duplicates across all custodians, instead of only removing duplicates found within the data of a single custodian. Global deduping removes more duplicates, but imposes two major restrictions.

First, once a custodian's data is included in global deduping, it will be impractical to later decide to exclude that custodian's data from review. Assume that a set of data, including data collected from custodian Bob, is globally deduplicated. Duplicate documents in the data of other custodians may be removed and not reviewed with the expectation that the “survivor” duplicate, residing in Bob's data, will be reviewed. If a decision is made after global deduplication to remove Bob's data, no copy of that particular document will be reviewed.

Second, when only one copy of a document that existed in multiple locations in a collection is produced, it may be difficult to say which custodian originally owned it. This poses challenges in productions which are organized by custodian. If custodian information is not being identified in the production and custodians will not be removed from the dataset, global deduplication may be appropriate.

Deduplication can save enormous amounts of money and time. You must define what a dupe is in your case – functional v. exact, global v. custodian level - and that will determine the method of deduplication that should be used. No approach is inherently better than the other; what matters is what works best given the requirements of a particular case.

Amy Hirschkron is a consultant with Strategic Discovery, Inc.
E-mail: ahirschkron@strategicdiscovery.com

© 2008 Strategic Discovery, Inc.