- 0 Comments
Two weeks ago I said I would write a blog revealing the secrets of search experts. I am referring to the few technophiles, lawyers, and scientists in the e-discovery world who specialize in the search for relevant electronic evidence in large chaotic collections of ESI such as email. I promised the exposé would include a secret deeply hidden in shadows, one only half-known by a few. Before I can get to the dark secret, I must lay bare a few other search secrets that are not so hidden.
A Secret of Search Already Known to Many
The first secret of search here exposed is the same kind of secret as those revealed in Spilling the Beans on a Dirty Little Secret of Most Trial Lawyers. You probably have heard it already, especially if you have read Judge Peck’s famous wake-up call opinion in William A. Gross Construction Associates, Inc. v. American Manufacturers Mutual Insurance Co., 256 F.R.D. 134, 136 (S.D.N.Y. 2009). He repeated it again recently in his article Predictive Coding: Reading the Judicial Tea Leaves, (Law Tech. News, Oct. 17, 2011), that I wrote about in Judge Peck Calls Upon Lawyers to Use Artificial Intelligence and Jason Baron Warns of a Dark Future of Information Burn-Out If We Don’t. Despite these writings and many CLEs on the subjects, most of your less informed colleagues in the law still don’t know these things, much less litigants or the public at large. It would seem that Jason R. Baron’s dark vision of a future where no one can find anything is still a very real possibility.
The wake-up call on search has a long way to go before it is a shot heard round the world. I am reminded of that on almost a daily basis as I interact, usually indirectly, with opposing counsel in employment cases around the country. They often insist on antiquated search methods. So bear with me while I begin by repeating what you may have already heard before. I promise that the exposé of these more common secrets will also set the stage for revealing the seventh step of incompetence causality that I mentioned in last week’s blog, Tell Me Why?, and the one deep dark search secret that you probably have not heard before. Yes, the one is related to the other.
The First Secret: Keywords Search Is Remarkably Ineffective at Recall
First of all, and let me put this in very plain vernacular so that it will sink in, keyword search sucks. It does not work, that is, unless you consider a method that misses 80% of relevant evidence to be a successful method. Keyword search alone only catches 20% of relevant evidence in a large, complex data set, such as an email collection. Yes, it works on Google, it works on Lexis and Westlaw, but it sucks in the legal world of evidence gathering. It only provides reliable recall value when used as part of a multimodal process that uses other search methods and quality controls, such as iterative testing, sampling, and adjustments. It fails miserably when used in the Go Fish context of blind guessing, which is the negotiated method still used by most lawyers today. I have written about this many times before and will not repeat it here again. See eg.Child’s Game of “Go Fish” is a Poor Model for e-Discovery Search.
Keyword Search Still Has a Place in Best Practices
Keyword search still has a place at the table of Twenty-First Century search, but only when used as part of a multimodal search package with other search tools, and only when the multimodal search is used properly with iterative processes, real-time adjustments, testing, sampling, expert input and supervision, and other quality control procedures. For one very sophisticated example of what I mean, consider the following description by Recommind, Inc. of their patented Predictive Coding process that is embedded in their software review tool, Axcelerate. Their software uses highly advanced AI-guided search processes, but keywords are still one of the many search tools used in that process:
The Predictive Coding starts with a person knowledgeable about the matter, typically a lawyer, developing an understanding of the corpus while identifying a small number of documents that are representative of the category(ies) to be reviewed and coded (i.e., relevance, responsiveness, privilege, issue-relation). This case manager uses sophisticated search and analytical tools, including keyword, Boolean and concept search, concept grouping and more than 40 other automatically populated filters collectively referred to as Predictive Analytics™, to identify probative documents for each category to be reviewed and coded. The case manager then drops each small seed set of documents into its relevant category and starts the “training” process, whereby the system uses each seed set to identify and prioritize all substantively similar documents over the complete corpus. The case manager and review team (if any) then review and code all “computer suggested” documents to ensure their proper categorization and further calibrate the system. This iterative step is repeated … (emphasis added)
The final step in the process employs Predictive Sampling™ methodology to ensure the accuracy and completeness of the Predictive Coding process (i.e., precision and recall) within an acceptable error rate …
Sklar, Howard, Using Built-In Sampling to Overcome Defensibility Concerns with Computer-Expedited Review, Recommind DESI IV Position Paper. Here is the diagram that Recommind now uses to describe their overall process, which they were kind enough to give me permission to use:
Note that keyword search, including Boolean refinements, is used as part of the seed set generation step, which they call the first Predictive Analytics step in their multimodal process. By the way, as I will explain when I reveal the second search secret in a minute, that 95%-99% accuracy statement you see in their chart should be taken with a very large grain of salt. Still, aside from the dubious percentages claimed in this chart, the actual search methods and processes used are good. If you like videos and images to help this all sink in, check out Recommind’s pretty cool YouTube video that has a good, albeit over-simplistic, explanation of predictive coding:
Proof of the Inadequacies of Keyword Search When Not Used as Part of a Multimodal Process
Want scientific proof of the incompetence of keyword search alone when not used as part of a modern multimodal process? Look at the landmark research on Boolean search by information scientists David Blair and M.E. Maron in 1985. The study involved a 40,000-document case (350,000 pages). The lawyers, who were experts in keyword search, estimated that the Boolean searches they ran uncovered 75% of the relevant documents. In fact, they had only found 20%. Blair, David C., & Maron, M. E., An evaluation of retrieval effectiveness for a full-text document-retrieval system; Communications of the ACM Volume 28, Issue 3 (March 1985).
Delusion is a wonderful thing, is it not? We are confident our search terms uncovered 75% of the relevant evidence. Really? Still, no one likes the fool who points out that the emperor is naked, especially the emperor and his tailors who frequently pay all of the bills. Still, here I must go, where angels fear to tread. I must point out what science says.
Please join me in this Quixotic quest. Spread the word. Somebody has to do it. We must all continue to tell the unpopular truth, lest Baron’s dark vision of a future world comes true. A world of injustice where relevant evidence is lost in ESI skyscrapers of junk, where cases are decided on false testimony and whim. We don’t want that world. We have worked way too hard over centuries to build our systems of justice to let a few billion terabytes of ESI destroy them. But destroy them they will, if we are complacent. Baron’s dystopian nightmares are real.
Want more recent scientific proof of the Emperor’s old clothes? See the research conducted by the National Institute of Standards and Technology TREC Legal Track. It has again confirmed that keyword search alone still finds only about 20%, on average, of relevant ESI in the search of a large data set. In batch tests in 2009 of negotiated keyword terms they did much worse. Hedlin, Tomlinson, Baron, Oard, 2009 TREC Legal Track Overview, TREC legal track at §3.10.9. The Boolean searches had a mean precision ratio of 39%, but recall averaged less than 4%. Yes! You read that right. The negotiated keywords missed 96% of the documents. Oopsie. I wonder how many times lawyers have done this in practice and never known it? We are confident our search terms uncovered 75% of the relevant evidence.
Please note this awful 4% recall came out of what they called the batch tasks, where there were no subject matter experts, testing, or appeals. These safeguards were present only in the interactive tasks. The batch tasks are thus like my Go Fish scenario, where people simply guess keywords in the blind, and never test, sample, refine and iterate.
The same research also shows that alternative multimodal methods do much better. They still use some keyword-based search tools, but also use predictive coding and other artificial intelligence algorithms with seed-set iteration and sampling methodologies. I wrote about these new methods in Judge Peck Calls Upon Lawyers to Use Artificial Intelligence and Jason Baron Warns of a Dark Future of Information Burn-Out If We Don’t and before that in The Information Explosion and a Great Article by Grossman and Cormack on Legal Search.
Want still more recent proof? The final report on the 2010 TREC tests has not been completed, but many participants’ reports are final. I have done some deep digging and read most of them, and the draft summary report, in order to try to bring to you the latest evidence on search. Seehttp://trec-legal.umiacs.umd.edu/. The 2010 tests once again confirm our little secret on the absurd ineffectiveness of keyword search alone. The confirmation comes inadvertently from the tests done by a fine team of information science graduate students from the Indian Statistical Institute, Kolkata, in West Bengal, India. They participated in the 2010 TREC Legal Interactive task in Topic 301 and Topic 302. (Yes, science is very international, including information science and TREC Legal Track.) They performed what proved to be an interesting (to me) experiment, although for reasons other than what they intended.
The Indian Statistical Institute had an AI predictive software coding tool using clustering techniques that they wanted to test. But the software could not handle the high volumes of email involved in the 2010 test: 685,592 items. So they had no choice but to cull down the amount of email somewhat before they could use their software. For that reason they decided to use keywords to cull down the corpus (a term information scientists love to use) before running their AI clustering software. Here is their own description of the process:
We attempted to apply DFR-BM25 ranking model on the TREC legal corpus. We chose Terrier 3.0, as this toolkit has most of the IR methods implemented within. But as we received the TREC legal data set, we realized that it would be difficult to manage such a large volume of data. So we decided to reduce the corpus size by Boolean retrieval. We chose Lemur 4.11, as it supports various useful Boolean query operators which would suit our purpose. On the set obtained from Boolean retrieval we decided to apply ranked retrieval techniques. … The use of Boolean retrieval has the disadvantage that it will limit further search to the documents retrieved at this stage and have an adverse effect on our recall performance. But it would scale down the huge corpus size considerably (see Table 1) and enable us to perform our experiments on a smaller set which would reduce processing time.
That use of keyword Boolean as an upfront filter turns out to have been a mistake, at least insofar as any quest for good recall was concerned. Who knows, maybe they thought their keywords would be better than the lawyer-derived keywords in the famous Blair Maron study. I see this kind of mistake made by opposing counsel all of the time. We are confident our search terms will uncover 75% of the relevant evidence. They think their keywords are so good that they could not possibly miss 80% of all relevant document in the corpus. They have an almost superstitious belief in the magical power of keywords, and think that their Boolean is better that your Boolean. Hogwash! All keyword search sucks, no matter who you are or how many lawsuits you’ve won or Google sites you’ve found.
The computer algorithms used in the 1985 Blair Maron test are essentially the same used today for keyword search. Keyword search is pretty simple index matching stuff. Antiquated software really. It works fine in academic settings with artificiality controlled data sets or organized databases, but it does not survive contact with the real world where words and symbols are chaotic and vague, just like the people who create them. In real world email collections the meaning of documents is hidden in subtle, and not-so-subtle, word and phrase variations, misspellings, abbreves, slang, obtusity, etc. In reality, when large data sets are involved, no human is smart enough to guess the right keywords.
Getting back to the 2010 TREC study, in topic 301 the use of Boolean retrieval allowed the scientists from India to reduce the initial corpus from 685,592 to 2,715. Then they ran their sophisticated software on the whittled down corpus. The final metrics must have been disappointing. The TREC judges found that their precision in topic 301 was pretty good. It was 87% (meaning 87% of the items retrieved were determined to be relevant after an appeal process). But their recall was simply terrible, only 3% (meaning their method failed to retrieve an estimated 97% of the relevant documents in the original 685,592 collection). Random guessing might have done as well in the recall department, maybe even in the F1 measure (the harmonic mean of precision and recall).
In their other interactive task topic, 302 the results were comparable. They attained a precision rate of 69% and a recall rate of 9%. Again this means that they left 91% of the relevant documents on the table and only managed to find 9% of the relevant documents.
The Second Search Secret (Known Only to a Few): The Gold Standard to Measure Review is Really Made Out of Lead
The so-called gold standard used to judge recall and precision rates in information science studies is human review. This brings up an even more important secret of search, a subtle secret known only to a few. Experiments in TREC conducted well before the legal track even began showed that we humans are very poor at making relevancy determinations in large data sets. This is a very inconvenient truth because it puts all precision and recall measurements in doubt. It means that the recall and precision measures we use are more like rough estimates than calculations. It may be the measurements can be improved by expensive remedial, three-pass expert human reviews, and other methods, but even that has yet to be proven. But see Cormack, Grossman, Inconsistent Assessment of Responsiveness in E-Discovery: Difference of Opinion or Human Error? (2011) (Humans can agree and create a gold standard if relevance is defined clearly enough to reviewers and if objective mistakes by reviewers, as opposed to subjective disagreements, are identified and corrected).
This secret of human inadequacy and resulting measurement vagaries in large data-set reviews has been known in the information science world since at least 2000. I understand from inquiring of Doug Oard, a well-known information scientist and one of the TREC Legal Track founders, that the problem of the “fuzziness” of relevance judgments remains an important and ongoing discussion among scientists. Apparently the “fuzziness” issue is far less of a problem when simply trying to compare one system with another, and determine which one is better, than it is when trying to report a correct (“absolute”) value for some quantity such as recall or precision. I corresponded with Doug Oard on this issue and he advised me that:
The Legal Track of TREC has generated quite a lot of attention to the problem of absolute evaluation simply because the law, properly, has a need for that information. But the law also has a need for relative evaluation (which can help to answer questions like “Did you use the best available approach under the circumstances”), and “fuzziness” is well-known to have only limited effects on such relative comparisons.
So even though our measurements are too fuzzy to ever really say with any assurance that there is 95%-99% accuracy, it can tell us how one method compares with another. For instance, we can know that keyword search sucks when compared with multimodal. We just cannot know exactly how well either of them do.
The fuzziness of recall measurements may explain the wide divergences in measurements of search effectiveness. For instance, it could explain how the 2009 batch tests of keywords only measured a remarkably low 4% recall rate. 2009 TREC Legal Track Overview, TREC legal track at §3.10.9. It may have been better than that, more in line with the usual 20% recall rates that other experiments have shown, but we do not really know because the gold standard measurements can fluctuate wildly. Again this is all because average one-pass human review is known to be unreliable.
The fuzziness issue is one of several important topics addressed in an interesting paper written this year by a young information scientist, William Webber, entitled Re-examining the Effectiveness of Manual Review. Webber is an Australian now doing his post-doctoral work with Professor Oard. His paper arose out of an e-discovery search conference held this year in China, of all places, the SIGIR 2011 Information Retrieval for E-Discovery (SIRE) Workshop, July 28, 2011, Beijing, China. You may have heard about this event from some of its other attendees, including Jason R. Baron, Patrick Oot, Jonathan Redgrave, Conor Crowley, Bill Butterfield, Doug Oard, and David Lewis. Anyway, Webber in his China paper explains:
It is well known that human assessors frequently disagree on the relevance of a document to a topic. Voorhees  found that experienced TREC assessors, albeit working from only sentence-length topic descriptions, had an average overlap (size of intersection divided by size of union) of between 40% and 50% on the documents they judged to be relevant. Voorhees concludes that 65% recall at 65% precision is the best retrieval effectiveness achievable, given the inherent uncertainty in human judgments of relevance. Bailey, et al.,  survey, other studies giving similar levels of inter-assessor agreement.
Can anyone validly claim absolute recall or precision rates in large data set reviews that is more than 65% when the determinations are made by single pass human review? Apparently not. Maybe double or triple pass review can create a true gold standard. I know that is what TREC is now striving for, using sampling and an appeals process in the experiments since 2009. But has that been proven? I don’t think so. At least that is my impression after reading Webber’s work.
Webber’s China paper goes on to explain the well-known study by Roitblat, Kershaw, and Oot, Document Categorization in Legal Electronic Discovery: Computer Classification vs. Manual Review. Journal of the American Society for Information Science and Technology, 61(1):70–80, 2010.
For their study, the authors revisit the outcome of an earlier, in-house manual review. The original review surveyed a corpus of 2.3 million documents in response to a regulatory request, and produced 176,440 as responsive to the request; the process took four months and cost almost $14 million. Roitblat, et al., had two automated systems and two manual review teams review the documents again for relevance to the original request. The automated systems worked on the entire corpus; the manual review teams looked at a sample of 5,000 documents.
Roitblat, et al., (Table 1) found that the overlap between the relevance sets of the two manual teams was only 28%, even lower than the 40% to 50% observed in Voorhees  for TREC AdHoc assessors. The overlap between the new and the original productions was also low, 16% for each of the manual teams, and 21% and 23% for the automatic systems. …
The effectiveness scores calculated on the original production seemingly show that the automated systems are as reliable as the manual reviewers. However, as Roitblat, et al., note, the original production is a questionable gold standard, since it likely is subject to the same variability in human assessment that the study itself demonstrates. Instead, the claim Roitblat, et al., make for automated review is a more cautious one, namely, that two manual reviews are no more likely to produce results consistent with each other than an automated review is with either of them.
Given the remarkably low level of agreement observed by Roitblat, et al., their conclusion might seem a less than reassuring one; an attorney might ask not which of these methods is superior, but is either of these methods acceptable? More importantly, the study does not address the attorney’s fundamental question: does automated or does manual review result in a production that more reliably meets the overseeing attorney’s conception of relevance?
Think about that. Lawyers are on average even worse than non-lawyers in making relevancy reviews. We only agree 28% of the time, compared to earlier non-lawyer tests noted by Voorhess showing 40% agreement rates. The 40% agreement rates showed that the best retrieval effectiveness achievable, given the inherent uncertainty in human judgments of relevance, was only 65% recall and 65% precision. See Ellen M. Voorhees, Variations in Relevance Judgments and the Measurement of Retrieval Effectiveness, 36:5 Information Processing & Management 697, 701 (2000). I wondered what an even lower 28% agreement rate as found in the Roitblat, et al., study meant? In private correspondence with Webber to prepare this essay, he advised me that a 28% agreement rate produces a mean precision and recall rate of 44%.
It seems to me as if Webber and Voorhees are saying that on average the best that lawyers can ever do using the so-called gold standard of human review for measurement is something like 65%-44% recall. Any measurements higher than that are suspect because the gold standard itself is suspect. I think Webber, Voorhees, and others are saying that the human relevancy determinations lens we are using to study these processes is too fuzzy, too out of focus, to give us any real confidence in exactly what we are seeing, but the fuzzy lens does allow us to compare one method against another.
The Triple Pass Solution
Although I do not understand the math on the fuzziness issue, I understand it in an intuitive way from over thirty years of arguing with other attorneys and judges over relevancy. I also know from the thousands of vague requests for production I have read and tried to respond to. In the law we use a kind of triple pass quality control method based on disagreements of experts. The triple-pass method has evolved in the common law tradition over the past few centuries. We never simply rely on one tired lawyer’s opinion. One lawyer expresses their view on relevance. Then another lawyer, opposing counsel, uses their independent judgment to either agree or disagree, and, if they disagree, to object. A third expert, a judge, then hears argument from both sides and makes a final determination. Without such triple expert input and review, the determination of the legal relevance of evidence in legal proceedings would also be unreliable.
TREC has been trying to use such a triple pass method since 2009 to buttress the accuracy of its findings. The first reviewers make their determinations, then the participants make theirs. If the participants disagree, then the participants can ask for a ruling from the subject matter expert who had been guiding the participants with up to ten hours of consults. The first review team has no such appeal rights and far less guidance. Also, the first pass reviewers cannot present their side of the arguments to the judge. Not surprisingly under these conditions, if and when the participants appeal, the reports show that the expert judges usually rule with the participants. They have, after all, had ongoing ex parte communications with them and don’t hear from the other side. Not exactly the same triple play as in the real world of American justice, but it is far better than the flawed single human review that Voorhees initially studied. Moreover, it is improving each year as TREC’s experiments are refined. To get closer to real world practice would require a lot more money for the experiments.
In my view, the inherent fuzziness (or not) of human relevance capacities is a significant problem that needs a lot of further study. Think of the implications on our current legal practice. (Hint – This has something to do with the seventh insight into trial lawyer resistance, as I will explain in Part II of Secrets of Search.)
Not Too Fuzzy To Allow Valid Comparisons
Although the measures are fuzzy, they are not too fuzzy to make comparisons between reviews. So, for instance, you can compare two human reviews and use the differences to show just how vague and inaccurate human review really is. This would be a comparison to establish the fuzziness of the gold standard you use to make recall, precision, and other measurements.
The study by Roitblat, et al., sponsored by the Electronic Discovery Institute (EDI) did just that. It proved the incredible inconsistencies of single pass human review in large data sets. This study examined a real world event where Verizon paid $14 million for contract reviewers to review 2.3 million documents in four months. (This is, by the way, a cost of $6.09 per document for review and logging only, a pretty good price for those days.) A second review by other reviewers commissioned by the study only agreed with 16% of the first determinations. Yes, there was only a 16% agreement rate. Incredible. Does that not suggest likely error rates of 84%?!
Surely this study by EDI is the death-blow to large-scale human reviews that are not in some way computer assisted to at least cull out documents before review. Why should anyone spend $14 million for such a poor quality product after seeing this study? (Yet, I’m told they still do this in the world of mergers and acquisitions and second reviews.) This is especially true when you consider that machine-assisted review is much faster and less expensive. Further, as the studies also show, the computer-assisted review is at least as reliable as most of the human reviewers (But maybe not all, as will be explained; that is yet another search secret).
With these limitations of human review and measurements in mind, consider the paper by Maura R. Grossman and Gordon V. Cormack, which analyzed the 2009 TREC legal track studies on this issue. Technology-Assisted Review in E-Discovery can be More Effective and More Efficient than Exhaustive Manual Review. Richmond Journal of Law and Technology, 17(3):11:1–48, 2011. I have written about their paper before: The Information Explosion and a Great Article by Grossman and Cormack on Legal Search. Grossman and Cormack found that:
[T]he levels of performance achieved by two technology-assisted processes exceed those that would have been achieved by the official TREC assessors – law students and lawyers employed by professional review companies – had they conducted a manual review of the entire document collection.
Id. at 4. This was good research and a great paper as I’ve noted before, but the gold standard was again just human reviewers and so subject to the vagaries of fuzzy measurement when it comes to calculating absolute values. As mentioned, TREC is working on this issue with their appeals process, but due to economic constraints, it still differs from actual practice in several ways as mentioned. The first reviewers have relatively limited upfront instruction and training on the relevance issues, only limited contact with subject matter experts during the review, no testing or sampling feedback, and no appeal rights.
Also, the human review in TREC 2009 did not meet minimum ethical standards of supervision established by most state Bar associations that have considered the propriety of delegated review to contract lawyers. Most Bar associations require direct supervision of contract lawyers by counsel of record, and, in my opinion, that requires direct, ongoing contact to supervise. Aside from the supervision issue, the statistics were skewed by a one-sided appeals process where the judge only heard one side of a relevancy argument from the party they trained in relevance. It reminds me of a secret for getting an “A” in law school from some professors: just tell them what you think they want to hear, not what you really think.
For that reason, the win observed by Grossman and Cormack may not say as much about technology as it does about methodologies. Also, the paper focuses on the two technology-assisted processes that were better. What about the other technology-assisted processes that were not better?
Aside from these methodology concerns, as Webber points out, none of the studies so far, by TREC or anyone else, have addressed the key issue of concern to lawyers:
… which is not how much different review methods agreed or disagree with each other (as in the study by Roitblat, et al., ), nor even how close automated or manual review methods turn out to have come to the topic authority’s gold standard (as in the study by Grossman and Cormack ). Rather, it is this: Which method can a supervising attorney, actively involved in the process of production, most reliably employ to achieve their overriding goal, to create a production consistent with their conception of relevance. (emphasis added)
Let’s Spend the Money Necessary to Turn Lead Into Gold
How can the studies and scientific research ever give us an answer to Webber’s question, to our question, if the measuring device, the gold standard, is too fuzzy to make absolute measurements, just comparative ones? It seems to me the solution is a series of multimillion dollar scientific experiments, instead of the shoe-string-budget type projects we have had so far. We need experiments where the three-pass gold standard developed by the law is employed, where time-consuming quality controls are employed for both automated and manual reviews, and for various types of combined multimodal methods. We need to transform our lead standard into a bonafide gold standard. Yes, that means expensive relevancy determinations made by three-expert, triple-pass, statistically checked, state of the art reviews. But we have to go for the gold. We need absolute measurements we can trust and bank on to do justice.
These kind of scientific experiments will be expensive, but I think we should do it. Gold for gold. But it is worth it. After all, billions of dollars in fees are spent each year on e-discovery review. Trillions of dollars more ride on the outcome of litigation. What if a method we employ does not work as well as we think, and a key privileged document is overlooked? It could be game over. What if we end up looking at way too many of the wrong documents? How much money is lost already each year doing that? What if the 50% recall measurement you made is rejected by the court as too low, when in fact it was a 95% recall rate? What if the 95% recall measurement is really just 50%? With these constraints on measurements, how much recall should be considered legally sufficient? Should these measurements be used at all? Or should we just use methods that compare well with others, that use best practices, and not try to quantify precision and recall?
We need to really know what we are doing. We cannot just be alchemists playing with quicksilver. We need real science to verify exactly how accurate our methods are, not just compare them. We need to know more than comparative values. We need absolute measures. Heisenberg be damned, we need certainty in the law, or at least a lot more of it than we have now. Sure, we know that computer-assisted review is faster, cheaper and at least as good as average human review. But what recall rates do any of them really achieve? Sure we know that keyword search sucks, that multimodal is comparatively much better. But how much better? Is the true rate of recall for keywords 20% or 4%, or is it 44%? What is the true rate of recall for our top multimodal search techniques today, the ones like Recommind’s that uses keywords, Predictive Coding, and a variety of other tools and methods? Is it 97% or is it 44% or less? We need hard numbers, not just comparisons.
Law and IT alone cannot give us the answers. The e-discovery team also needs scientists. We need to know what kind of recall rates and precision rates we are capable of measuring with a confidence level in the 90s, not just 44% to 65%. Is plus or minus 44% recall really the best anyone can hope for? Is the confidence level such that a measure of 44% recall might actually be much higher, might actually be 98%? And visa versa? Are we just kidding ourselves with all of the recall measures we now have? Apparently so. All we can tell for sure right now is which method is better than another. That is not enough for the law. We need much more certainty than that.
The secret is now out and we have to address it. We have to talk about it. We have to perform experiments and peer review these experiments. I personally think the law’s triple-pass methods with the latest quality control techniques will produce significantly higher rates of agreement, maybe even in the 90s, but who actually knows until we pay for the experiments?
I think the research that TREC and EDI have done to date are a good start, but not the final word by any means. We need many more open scientific experiments. The testing must be improved and several more groups should join in. Our major information science universities worldwide should join in. So should the National Science Foundation and other charitable organizations. So, too, should the big companies that can afford to finance pure research. How about Google? IBM? Microsoft? EMC? HP? Xerox? How about your company? Every e-discovery company should have some skin in this game.
The budgets of the testing organizations need to be ramped up for all of these experiments. We need gold to make measurements with a true gold standard, to give us real answers, not just qualified comparisons. I will make a donation and participate in fundraisers for that kind of scientific research. Will you? Will your company or firm join in?
There is still more to the insights contained in Webber’s research in Re-examining the Effectiveness of Manual Review. But this week’s essay is already too long, so that, my friends, will have to wait again for next week. Webber’s work and the discussion so far sets the stage for an even deeper and darker secret of search, the one that ties into the seventh insight to lawyer resistance to e-discovery. That will come at the conclusion of next week’s blog, Secrets of Search – Part Two.
Originally posted on e-Discovery Team® blog by Ralph Losey – a blog on the team approach to electronic discovery combining the talents of Law, IT, and Science. The views expressed are his own, and not necessarily those of his law firm, clients, or University. Copyright Ralph Losey 2006-2012.
He also leads an online e-Discovery Team Training course that has become very successful over the last couple of years.