Train, Don’t Cull, Using Keywords

/
01Jun2013

Train, Don’t Cull, Using Keywords

  • 1 Tags
  • 0 Comments
By Craig Ball

I’ve been thinking about how we implement technology-assisted review tools and particularly how to hang onto the on-again/off-again benefits of keyword search while steering clear of its ugliness. The rusty flivver that is my brain got a kick start from many insightful comments made at the recent Carmel Valley E-discovery Retreat in Monterey, California. As is often the case when the subject is technology-assisted review (by whatever name you prefer, dear reader: predictive coding, CAR, automated document classification, Francis), some of those kicks came from lawyer Maura Grossman and computer scientist Gordon Cormack.  So if you like where I go with this, credit them. If not, blame me for misunderstanding.

Maura and Gordon are the power couple of predictive coding, thanks to their thoughtful papers and presentations transmogrifying the metrics of NIST TReC into coherent observations concerning the efficacy of automated document classification. While they’re spinning straw into gold, I’m still studying it all; but from where I stand, they make a lot of sense.

Maura expressed the view that technology-assisted review tools shouldn’t be run against subset collections  culled by keywords but should be turned to the larger collection of ESI (i.e., the collection/sources against  which keyword search might ordinarily have been deployed). The gist was, “use the tools against as much information as possible, and don’t hamstring the effort by putting old tools out in front of new ones.” (I’m not quoting here, but relating what I gleaned from the comment).

At the same Monterey conference, Judge Andrew Peck reminded us of the perils of GIGO (Garbage In: Garbage Out) when computers are mismanaged. The devil is very much in the details of any search effort but never more so than when one deploys predictive coding in e-discovery. Methodology matters.

If technology-assisted review were the automobile, we’d still be at the stage where drivers asked, “Where do I hook up my mules?” Our “mules” are keyword search.

When you position keyword search in front of predictive coding, that is, when you use keyword search to create the collection that predictive coding “sees,” the view doesn’t change much from the old ways. You’re still looking at the ass end of a mule. Breathe deep the funky fragrance of keyword search. Put axiomatically, no search technology can find a responsive document that’s not in the collection searched, and keyword search leaves most of the responsive documents out of the collection.

Keyword search can be very precise but at the expense of recall. It can achieve splendid recall scores but with abysmal precision. How, then, do we avail ourselves of the sometimes laser-like precision of keyword search without those awful recall in-laws coming to visit? Time and again research proves that keyword search performs far less effectively than we hope or expect. It misses 30-80% of the truly responsive documents and sucks in scads of non-responsive junk, hiding what it finds in a blizzard of blather.

To be clear, that’s an established metric based on everyone else in the world. It doesn’t apply to YOU. YOU have the unique ability to frame fantastically precise and effective keyword searches like no one else. Likewise, all the findings about the laughably poor performance of human reviewers apply only to other reviewers, not to YOU. Tragically, not everyone has the immense good sense to employ YOU; so, let’s take YOU and what YOU can do out of the equation until human cloning is commonplace. Okay?

For all their shortcomings, mules are handy. When your Model-T gets stuck in the mud, a mule team can pull you out. Likewise, keyword search is a useful tool to pull us out of the sampling swamp and generate training sets. Using keywords you’re more likely to rapidly identify some responsive documents than using random sampling alone. These, in turn, increase the likelihood that predictive coding tools will find other responsive documents in the broader collection of ESI sources. Good stuff in; good stuff out.

With that in mind, I made the following diagram to depict how I think keyword search should be incorporated into TAR and how it shouldn’t. (George Socha is so much better at this sort of thing, so forgive my crude effort.)

I hope you’ll agree that the interposition of keyword search to cull the collection before it’s exposed to an automated document classification tool is wrong. But, in fairness, doing it the right way could come at a cost, depending upon how you approach the assembly and processing of potentially responsive ESI. If you have to pay significantly more to let the tool “see” significantly more data, then quality will be sacrificed on the altar of savings. How it shakes out in your case hinges on how you handle keyword search and what you’re charged for ingestion and hosting. Currently, many use keyword search via entirely separate tools and workflows to reduce the volume of information collected, processed, and hosted. Garbage in.

Another caution I think important in using keywords to train automated classification tools is the requirement to elevate precision over recall in framing searches to ensure that you don’t end up training your predictive classification tool to replicate the shortcomings of keyword search. If only 20 percent of the documents returned by keyword search are responsive, then you don’t want to train the tool to find more documents like the 80 percent that are junk. So when, in the illustration above, I depict keyword search as a means to train technology-assisted review tools, please don’t interpret the line leading from keyword search to TAR as suggesting that the usual guesswork approach to keyword search is contemplated and you’ll just dump keyword results into the tool. That’s like routing the exhaust pipe into the passenger compartment. The searches required need to be narrow–precise–surgical. They must jettison recall to secure precision and may even benefit from a soupçon of human review.

For the promise of predictive coding to be fulfilled, workflows and pricing must better balance the quality versus cost equation. Yes, a technology that is less costly when introduced at nearly any stage of the review process is great and arguably superior only by being no worse than alternatives. But if that is all we seek when quality is also within easy reach, we do a disservice to justice. The societal and psychic benefits of a more trusted and accurate outcome to disputes cannot be overvalued. “Perfect” is not the standard but neither is “screw it.”

About the Author Craig Ball of Austin is a Board Certified trial lawyer, certified computer forensic examiner, and electronic evidence expert. He’s dedicated his globetrotting career to teaching the bench and bar about forensic technology and trial tactics. After decades trying lawsuits, Craig now limits his practice to service as a court-appointed special master and consultant in computer forensics and electronic discovery and to publishing and lecturing on computer forensics, emerging technologies, digital persuasion, and electronic discovery. Craig writes the award-winning Ball in Your Court column on electronic discovery for Law Technology News and is the author of numerous articles on e-discovery and computer forensics, many available at www.craigball.com. Craig Ball has consulted or served as the Special Master or testifying expert in computer forensics and electronic discovery in some of the most challenging and well known cases in the U.S.

Craig Ball © 2012

COMMENTS
Almanya sohbet anal yapan escort