Streetlight Effect in E-Discovery

/
01Aug2013

Streetlight Effect in E-Discovery

  • 1 Tags
  • 0 Comments

By Craig Ball

In the wee hours, a beat cop sees a drunken lawyer crawling round under a streetlight searching for something. The cop asks, “What’s this now?” The lawyer looks up and says, “I’ve lost my keys.” They both search for a while, until the cop asks, “Are you sure you lost them here?” “No, I lost them in the park,” the tipsy lawyer explains, “but the light’s better over here.”

I told that groaner in court, trying to explain why opposing counsel’s insistence that we blindly supply keywords to be run against the e-mail archive of a Fortune 50 insurance company wasn’t a reasonable or cost-effective approach e-discovery. The “Streetlight Effect,” described by David H. Freedman in his 2010 book Wrong, is a species of observational bias where people tend to look for things in the easiest ways. It neatly describes how lawyers approach electronic discovery. We look for responsive ESI only where and how it’s easiest, with little consideration of whether our approaches are calculated to find it.

Easy is wonderful when it works, but looking where it’s easy when failure is assured is something no sober-minded counsel should accept and no sensible judge should allow.

Consider The Myth of the Enterprise Search. Counsel within and without companies and lawyers on both sides of the docket believe that companies have the ability to run keyword searches against their myriad siloes of data: mail systems, archives, local drives, network shares, portable devices, removable media, and databases. They imagine that finding responsive ESI hinges on the ability to incant magic keywords like Harry Potter. Documentum Relevantus!

Though data repositories may share common networks, they rarely share common search capabilities or syntax. Repositories that offer keyword search may not support Boolean constructs (queries using “AND,” “OR,” and “NOT”), proximity searches (Word1 near Word2), stemming (finding  “adjuster,” “adjusting,” “adjusted,” and “adjustable”) or fielded searches (restricted to just addressees, subjects, dates, or message bodies). Searching databases entails specialized query languages or user privileges. Moreover, different tools extract text and index such extractions in quite different ways, with the upshot being that a document found on one system will not be found on another using the same query.

But the Streetlight Effect is nowhere more insidious than when litigants use keyword searches against archives, e-mail collections, and other sources of indexed ESI.

That Fortune 50 company—call it All City Indemnity—collected a gargantuan volume of e-mail messages and attachments in a process called “message journaling.” Journaling copies every message traversing the system into an archive where the messages are indexed for search. Keyword searches only look at the index, not the messages or attachments; so, if you don’t find it in the index, you won’t find it at all.

All City gets sued every day. When a request for production arrives, they run keyword searches against their massive mail archive using a tool we’ll call Truthiness. Hundreds of big  companies use Truthiness, or software just like it, and blithely expect their systems will find all documents containing the keywords.

They’re wrong…or in denial.

If requesting parties don’t force opponents like All City to face facts, All City and its ilk will keep pretending their tools work better than they do, and requesting parties will keep getting incomplete productions. To force the epiphany, consider an interrogatory like this:

For  each  electronic  system  or  index  that  will  be  searched  to  respond  to discovery, please state:

a. The rules employed by the system to tokenize data so as to make it searchable;
b. The stop words used when documents, communications, or ESI were added to the system or index;
c. The number and nature of documents or communications in the system or index which are not searchable as a consequence of the system or index being unable to extract their full text or metadata; and
d. Any limitation in the system or index, or in the search syntax to be employed, tending to limit or impair the effectiveness of keyword, Boolean, or proximity search  in identifying documents or communications that a reasonable person would understand to be responsive to the search.

A court will permit “discovery about discovery” like this when a party demonstrates why an inadequate index is a genuine problem. So let’s explore the rationale behind each inquiry:

Tokenization Rules – When machines search collections of documents for keywords, they rarely search the documents for matches; instead, they consult an index of words extracted from the documents. Machines cannot read, so the characters in the documents are identified as “words” because their appearance meets certain rules in a process called “tokenization.” Tokenization rules aren’t uniform across systems or software. Many indices simply don’t index short words (e.g., acronyms). None index single letters or numbers.

Tokenization rules also govern such things as the handling of punctuated terms (as in a compound word like “wind-driven”), case (will a search for “roof” also find “Roof?”), diacriticals (will a search for Rene  also find René?) and numbers (will a search for “Clause 4.3” work?). Most people simply assume these searches will work. Yet, in many search tools and archives, they don’t work as expected or don’t work at all, unless steps are taken to ensure that they will work.

Stop Words – Some common “stop words” or “noise words” are simply excluded from an index when it’s compiled. Searches for stop words fail because the words never appear in the index. Stop words aren’t always trivial omissions. For example, “all” and “city” were stop words; so a search for “All City” will fail to turn up documents containing the company’s own name!  Words like side, down, part, problem, necessary, general, goods, needing, opening, possible, well, years, and state are examples of common stop words. Computer systems typically employ dozens or hundreds of stop words when they compile indices.

Because users aren’t warned that searches containing stop words fail, they mistakenly assume that there are no responsive documents when there may be thousands. A search for “All City” would miss millions of documents at All City Indemnity (though it’s folly to search a company’s files for the company’s name).

Non-searchable Documents – A great many documents are not amenable to text search without special handling. Common examples of non-searchable documents are faxes and scans, as well as TIFF images and some Adobe PDF documents. While no system will be  flawless in this regard, it’s important to determine how much of a collection isn’t text searchable, what’s not searchable, and whether the portions of the collection that aren’t searchable are of particular importance to the case. If All City’s adjusters attached scanned receipts and bids to e-mail messages, the attachments aren’t keyword searchable absent optical character recognition (OCR).

Other documents may be inherently text searchable but not made a part of the index because   they’re password protected (i.e., encrypted) or otherwise encoded or compressed in ways that frustrate indexing of their contents. Important documents are often password protected.

Other Limitations – If a party or counsel knows that the systems or searches used in
e-discovery will fail to perform as expected, they should be obliged to affirmatively disclose such shortcomings. If a party or counsel is uncertain whether systems or searches work as expected, they should be obliged to find out by, e.g., running tests to be reasonably certain.

No system is perfect, and perfect isn’t the e-discovery standard. Often, we must adapt to the limitations of systems or software. But you have to know what a system can’t do before you can find ways to work around its limitations or set expectations consistent with actual capabilities, not magical thinking and unfounded expectations.

About the Author

Craig Ball of Austin, is a Board Certified trial lawyer, certified computer forensic examiner, and electronic evidence expert. He’s dedicated his globetrotting career to teaching the bench and bar about forensic technology and trial tactics. After decades trying lawsuits, Craig now limits his practice to service as a court-appointed special master and consultant in computer forensics and electronic discovery and to publishing and lecturing on computer forensics, emerging technologies, digital persuasion, and electronic discovery. Craig writes the award-winning Ball in Your Court column on electronic discovery for Law Technology News and is the author of numerous articles on e-discovery and computer forensics, many available at www.craigball.com. Craig Ball has consulted or served as the Special Master or testifying expert in computer forensics and electronic discovery in some of the most challenging and well known cases in the U.S.

Craig Ball © 2012

COMMENTS