Best Practices in Predictive Coding: When are Pre-Culling and Keyword Searching Defensible?


Best Practices in Predictive Coding: When are Pre-Culling and Keyword Searching Defensible?

  • 1 Tags

By Jim Eidelman


Predictive coding is an effective e-discovery tool for ranking large sets of documents. However, it is commonly performed in a manner that may be severely under-inclusive and, therefore, raises concerns about its defensibility.

In the use of predictive coding, it is a common practice for the producing party to run keyword searches first and then sample and rank the resulting documents.  The documents that don’t hit on the searches are culled out before reaching the predictive coding process.

The reasons for doing it this way are:

  • Keyword searching is an accepted standard in e-discovery.
  • The client can avoid the per-document cost for the predictive coding software.
  • It reduces the number of documents that need to be reviewed and produced, which reduces time, cost, and risk.

However, this approach ignores the “dirty little secret” of e-discovery search—that keyword searches leave behind a large set of responsive/relevant documents.

Rank ALL (or Most) Documents, Not Just the Hits

The landmark e-discovery study about keyword searching was published in 1985 by Blair and Maron. David C. Blair & M.E. Maron, An Evaluation of Retrieval Effectiveness for a Full-Text Document-Retrieval System28 COMMUNC’NS. OF THE ACM 289, 295 (1985). In that study, the attorneys were confident that their searches had found more than 75% of the responsive documents.  But they were wrong.  In fact, the searches had only found 20% of the relevant documents.

This was the only study on the subject for many years. Despite it, keyword searching nevertheless became the accepted practice, largely because it was the best approach available.  More recently, however, studies by TREC and others have shown that Blair and Maron were right. TREC’s 2008 study found that keyword searching returned an average of just 24% of responsive documents. In its 2007 study, the result was 22%. Other studies returned even less.

While sampling the leave-behinds and iterative searching by adding terms discovered during the review can solve part of the problem, plenty of documents will be omitted from the review and production.

This is the reason that going through the predictive coding process against ALL documents, rather than just the keyword search hits, is generally the most defensible practice.  Predictive coding based on sampling ALL the documents will find documents that the keyword searches miss.

How to use Keyword Searching with Predictive Coding

Does this mean that you don’t need keyword searching and other pre-culling techniques?  Of course not. Here are some of the ways we use keyword searching in conjunction with predictive coding:

  • Junk removal. We always analyze the documents for “junk” that can obviously be removed.  For example, in the Enron collection, it certainly makes sense to cull out the fantasy football documents first.  In a securities case, there are usually huge numbers of email market letters and irrelevant stock recommendations that can defensibly be removed.
  • Boosting “richness.” In most cases, the Predictive Ranking software won’t work as well if the ratio of relevant documents to irrelevant documents (“richness percent”) is too low, meaning that the relevant documents are “sparse.”  It is legitimate and defensible to use keyword searching to boost the “richness” to a reasonable ratio before starting the predictive coding exercise.  Note, however, that it may make sense later to have the software rank the documents that didn’t hit on the keyword searches, as if it were a later rolling upload, to see if the software finds additional relevant documents.
  • Targeted searches. Often certain terms (or combinations of terms) will serve as a “rifle shot” to find important documents.  For example, in a patent case, a search for a technical term, such as a chemical name, may be important, especially when paired with the name of the inventor or the opposing party.  Targeted searching can be used at the beginning for sampling to find seed documents.  And, of course, it should be used throughout the review to find “rifle shot” documents and documents on sparsely populated issues that do not lend themselves to predictive coding.
  • Metadata. Many predictive coding applications only analyze text and not metadata.  Obviously, there are many times when searching metadata is a key to finding relevant documents or filtering out irrelevant ones.  For example, in finding privileged communications, it helps to look in the TO and FROM fields to see if there are attorneys and clients in them.  Similarly, date filters are critical in filtering out documents that may contain “relevant” terms but are irrelevant to the issues in the case because of the time frame.
  • Sampling and discrepancy analysis. While predictive coding applications typically include sampling methodologies, it is nevertheless a good idea to do additional sampling outside of the application, which can be done by searching for documents likely to be relevant. In particular, the discrepancy analysis, which compares software predictions with actual coding, will find documents that were given a low rank by the software.  Once that happens, you can go search for similar documents using “More Like This” and keyword searching.

The Bottom Line

If you run predictive coding with ALL the documents, and not just those that hit on keyword searches, you will find more relevant documents, so the process is more defensible.  But keyword searching and searching metadata are still critical tools in the e-discovery toolbox.

About the Author

A veteran trial lawyer and pioneer in legal technology, Jim Eidelman is the senior consultant onCatalyst’s Search and Analytics Consulting team. With more than three decades of hands-on experience providing technology advice to law firms and legal departments, he helps clients develop strategies to overcome their toughest e-discovery challenges.