Published in issue #001 of The
Strategist, Strategic Discovery's quarterly e-newsletter
Deconstructing Deduplication
By Amy
Hirschkron
With
the soaring costs of processing and reviewing electronic data, removing
duplicative data from a collection is clearly of great value.
Deduplication not only considerably lowers the cost of processing and
review, it also significantly shortens review time and reduces the
possibility of producing identical documents with contradictory review
calls.
“But
what exactly constitutes a duplicate file or email?” you ask. The
Federal Rules of Evidence offer the following, “[a]
“duplicate” is a counterpart produced by the same
impression as the original, or from the same matrix,…or by
mechanical or electronic re-re-recording, or by chemical reproduction,
or by other equivalent techniques which accurately reproduces the
original.” F.R.E. 1001(4). Crystal clear now, right? Or maybe
not.
Are
two virtually identical emails, one with several extra spaces in the
body, to be considered duplicates? Another such example of
“functional” duplicates would be an email sent back and
forth several times between two parties. Each copy of the email
contains portions of the conversation, but only the final email in the
chain contains the entire thread. Is it too much of a stretch to
conclude that all prior emails making up the thread are functional or
“near” duplicates of the final email and could therefore be
removed from the dataset? If the contents of two files are the same,
but the names of the files are not the same, are those two files
duplicates? The F.R.E. provides a guideline, but you must make these
calls. How you define a duplicate in a specific case will lead you to
the selection of the appropriate deduplication method.
There
are three methods typically used for literal deduplicating (deduping,
as it is affectionately called), 1) comparing hash values, 2) comparing
file metadata, and 3) comparing hash values of selected metadata fields
or text, such as the body of an email.
A
hash value is an algorithm, an electronic fingerprint, calculated from
the contents of a file. There are numerous hashing algorithms. The two
most commonly-used algorithms for electronic discovery deduplication
are MD5 and SHA-1. The MD5 hash calculates 128 bit values and the SHA-1
hash calculates 160 bit values. The chance that different documents
resolve to the same hash values is infinitesimal, so by comparing hash
values you can identify documents with identical contents.
Two
non-email files with identical content but different filenames or
timestamps will have the same hash value. In cases where it is not
desirable to remove such files from the review set, it would be more
appropriate to compare standalone files based on metadata.
Metadata
is data about data. It documents data attributes such as name, size,
and type; it records data structures such as length, fields, and
columns; and it details data properties such as where data is located,
how it is associated, and who owns it.
Deduplication
based on comparison of metadata is particularly effective for email,
where minute differences in formatting not apparent to the user can
cause changes in hash values. Deduping based on metadata will have
varying results, depending on the metadata that is compared. For
instance, if the number of attachments for email is not one of the
fields that is compared, two emails with identical content, one with an
attachment and one without, sent by the same party to two different
people, may be considered duplicates.
After
you have decided that you would like to dedupe and have defined what
constitutes a duplicate in your case, you must decide if you would you
like to globally dedupe. Global deduplication refers to removing
duplicates across all custodians, instead of only removing duplicates
found within the data of a single custodian. Global deduping removes
more duplicates, but imposes two major restrictions.
First,
once a custodian's data is included in global deduping, it will be
impractical to later decide to exclude that custodian's data from
review. Assume that a set of data, including data collected from
custodian Bob, is globally deduplicated. Duplicate documents in the
data of other custodians may be removed and not reviewed with the
expectation that the “survivor” duplicate, residing in
Bob's data, will be reviewed. If a decision is made after global
deduplication to remove Bob's data, no copy of that particular document
will be reviewed.
Second,
when only one copy of a document that existed in multiple locations in
a collection is produced, it may be difficult to say which custodian
originally owned it. This poses challenges in productions which are
organized by custodian. If custodian information is not being
identified in the production and custodians will not be removed from
the dataset, global deduplication may be appropriate.
Deduplication
can save enormous amounts of money and time. You must define what a
dupe is in your case – functional v. exact, global v. custodian
level - and that will determine the method of deduplication that should
be used. No approach is inherently better than the other; what matters
is what works best given the requirements of a particular case.
Amy Hirschkron is a
consultant with Strategic Discovery, Inc.
E-mail:
ahirschkron@strategicdiscovery.com
|