Pyrrho – why it’s time to start learning about predictive coding

In Pyrrho, the High Court approved the use of “predictive coding” (to facilitate the technology/computer assisted review of documents) in the disclosure process.

While judicial approval was not actually required – and Pyrrho is not the first case in which the technology has been used for e-disclosure in England and Wales – the judgment will serve as a useful guide and precedent to litigants considering the use of this technology in future cases.

But what exactly is predictive coding and why did the Court approve its use?

What’s in a (key)word?

Most e-disclosure is currently conducted by searching digital documents for keywords and then manually reviewing those that do. This is inherently time-intensive (and hence expensive).

A particular challenge of this approach is to strike a balance between excluding potentially relevant documents (keywords are too few or too narrow) and wasting time and/or money reviewing too many irrelevant documents (using too many keywords or keywords that are too wide).

Predictive coding

Surfacing potentially relevant documents based on keywords results in a binary output. Either the document contains one or more keywords – or it does not. Actual relevance is not determined until the resultant batch of documents has been manually reviewed.

By contrast, predictive coding seeks to determine the likely relevance of each document, thus automating much/most of the review process itself.

The central part of the process is known as “machine learning” and essentially involves 3 steps:

  1. The parties identify a “seed set”

A batch of documents is selected to form a “seed set”. The parties are free to agree selection criteria (which may include some use of keywords as identifiers) but the process would work equally well with a randomly generated seed set.

There are no prescribed parameters around what percentage of the total data set should be included. In Pyrrho, the parties estimated that the agreed seed set would comprise 1600 - 1800 documents (from over 3 million in total).

  1. An experienced lawyer uses the “seed set” to train the software

The system analyses the characteristics of the documents in the seed set and proceeds to present them to the lawyer conducting the training. The lawyer is asked to indicate whether each document is relevant or not.

With each such decision, the system builds an increasingly accurate model of which characteristics of documents in the seed set result in the lawyer categorizing it as relevant.

  1. The system applies what it has learned to the entire data set

Having built a model which predicts the likely relevance of any given document in the seed set based on the lawyer’s decisions, the system applies this model to each document in the whole document set. It is then able to rank all of the documents in order of likely relevance.

Why did the High Court approve the use of predictive coding in Pyrrho?

In recognition of the relative novelty of using predictive coding for e-disclosure in this jurisdiction, Master Matthews set out no less than ten factors which, in his view, favoured its use in this case.

These included:

  • Evidence from other jurisdictions appears to show that predictive coding has been “useful” in appropriate cases and (at least) as accurate as manual review (whether or not combined with keyword searches).
  • Using technology to apply the approach of a single lawyer (ie the “trainer”) should produce a more consistent result than having a number (perhaps hundreds) of more junior fee-earners seeking to apply the relevant criteria independently.
  • The cost of a full manual review of each document would have been “unreasonable” given the volume of documents (over three million) given the existence of a suitable lower-cost automated alternative.
  • The parties had agreed on the use of the software, and also how to use it, subject only to the Court’s approval.

It was also noted that there were “no factors of any weight pointing in the opposite direction”. As ever, the question of “whether it would be right for approval to be given in other cases will, of course, depend upon the particular circumstances in obtaining them” (see para 34 of the judgment).

Challenges ahead?

In this case, the parties had agreed to use predictive coding and the Court saw clear benefits in their doing so. In a case where the parties disagree on whether predictive coding should be used and/or on the exact mechanics of its use, the Court will have a tougher challenge to face.

Machine learning: beyond e-disclosure

Take the machine learning process used in Pyrrho which involved three million documents. In 2013, LexisNexis was faced with the task of creating and applying a new taxonomy to all eight million documents held in its database of legal documents. While the project took two years to complete, we now have an AI based system capable of performing the exercise in just hours. You can read about it here.

Over to you…

The next time you encounter a problem or bottleneck in your business, ask yourself, “Could this be solved or improved with AI?”

If you think the answer might be yes, our platform innovation and product development teams are always up for a challenge.

Previous (free) workshops have ranged from helping to break down big problems into manageable technology chunks, all the way to inspiring proofs of concept for entirely new solutions to challenges that may be shared by your own business or even the entire industry!

If you would like to discuss anything in this article or would like to find out more about a workshop with our platform innovation team, please contact Alex Smith.

Filed Under: Technology

Relevant Articles
Area of Interest