Predictive Coding, when correctly employed, can significantly reduce legal review costs with generally more accurate results than other traditional legal review processes. However, the benefits associated with predictive coding are often undercut by the over-collection and over-inclusion of Electronically Stored Information (ESI) into the predictive coding process. This is problematic for two reasons.
The first reason is obvious, the more data introduced into the process, the higher the cost and burden. Some practitioners believe it is necessary to over-collect and subsequently over-include ESI to allow the predictive coding process to sort everything out. Many service providers charge by volume, so there can be economic incentives that conflict with what is best for the end-client. In some cases, the significant cost savings realized through predictive coding are erased by eDiscovery costs associated with overly aggressive ESI inclusion on the front end.
The second reason why ESI over-inclusion is detrimental is less obvious, and in fact counter intuitive to many. Some discovery practitioners believe as much data as possible needs to be put through the predictive coding process in order to “better train” the machine learning algorithms. However this is contrary to what is actually true. The predictive coding process is much more effective when the initial set of data has a higher richness (also referred to as “prevalence”) ratio. In other words, the higher the rate of responsive data in the initial data set, the better. It has always been understood that document culling is very important to successful, economical document review, and that includes predictive coding.
Robert Keeling, a senior partner at Sidley Austin and the co-chair of the firm’s eDiscovery Task Force, is a widely recognized legal expert in the areas of predictive coding and technology assisted review. At Legal Tech New York earlier this year, he presented at an Emerging Technology Session: “Predictive Coding: Deconstructing the Secret Sauce,” where he and his colleagues reported on a comprehensive study of various technical parameters that affect the outcome of a predictive coding effort. According to Robert, the study revealed many important findings, one of them being that a data set with a relatively high richness ratio prior to being ingested into the predictive coding process was an important success factor.
To be sure, the volume of ESI is growing exponentially and will only continue to do so. The costs associated with collecting, processing, reviewing, and producing documents in litigation are the source of considerable pain for litigants. The only way to reduce that pain to its minimum is to use all tools available in all appropriate circumstances within the bounds of reasonableness and proportionality to control the volumes of data that enter the discovery pipeline, including predictive coding.
Ideally, an effective early case assessment (ECA) capability can enable counsel to set reasonable discovery limits and ultimately process, host, review and produce less ESI. Counsel can further use ECA to gather key information, develop a litigation budget, and better manage litigation deadlines. ECA also can foster cooperation and proportionality in discovery by informing the parties early in the process about where relevant ESI is located and what ESI is significant to the case. And with such benefits also comes a much more improved predictive coding process.
X1 Distributed Discovery (X1DD) uniquely fulfills this requirement with its ability to perform pre-collection early case assessment, instead of ECA after the costly, time consuming and disruptive collection phase, thereby providing a game-changing new approach to the traditional eDiscovery model. X1DD enables enterprises to quickly and easily search across thousands of distributed endpoints from a central location. This allows organizations to easily perform unified complex searches across content, metadata, or both and obtain full results in minutes, enabling true pre-collection ECA with live keyword analysis and distributed processing and collection in parallel at the custodian level. To be sure, this dramatically shortens the identification/collection process by weeks if not months, curtails processing and review costs from not over-collecting data, and provides confidence to the legal team with a highly transparent, consistent and systemized process. And now we know of another key benefit of an effective ECA process: much more accurate predictive coding.