Introduction to Legal Analytics, Part 2: The Black Swan Problem

Before the advent of legal analytics, causal inference was the best and most powerful tool for researchers. Our job is to separate causal inference from analytics in order to reach a deeper understanding of our data. We talked about this in last week’s blog post. This week, we look at what it takes to create a powerful predictive model. Our first step is to recognize common pitfalls that can occur.

The Black Swan Problem

Modern technology is developing at a rapid rate. The amount of raw data in the world increases exponentially every day. With such a massive amount of data, ordinary assumptions about how the world works are not enough. Conventional logic can be imprecise. Models built on past experience may be on shaky ground. The Black Swan Problem is one such cautionary tale.

The Black Swan Problem states that even if we repeatedly observe only white swans in nature, this does not mean that all swans are white. Essentially, it is a mistake to presuppose that a sequence of events in the future will occur as it always has in the past. There are “unknown unknowns” within any system, and though they are by definition unpredictable, they must be allowed for.

The Problem of Induction

David Hume was one of the first individuals to recognize the inadequacy of inductive reasoning and call attention to the Black Swan Problem. He questioned the natural human tendency to put faith in events that we assume from past experience to be causally related. Nature is not uniform. Nature cannot be predicted so easily. A “Black Swan” is a rare outlier, and its power comes from this rarity. Our minds are wired to search for causation in all things. If analytics is to do more than model data that already exists, we have to dispense with the notion of a data environment that the human mind can master unaided.


One of our biggest obstacles is anticipating and adjusting to a Black Swan’s impact on a system. To build a superior predictive model, we must A) consider the potential for outliers to occur, and B) do our best to incorporate these potentials into our model, so that C) we can respond effectively to unforeseeable shifts – both gradual and dramatic – within a system.

We accomplish these goals using a method called Machine Learning.

This article is part of our 7-part Intro series. The others can be found here.