Introduction To Legal Analytics, Part 3: Machine Learning

Machine Learning is defined as a form of artificial intelligence in which a computer has the ability to learn without being explicitly programmed. This process is used to develop statistical models, and in order to build these models we must first collect data to feed into the program. As we’ve explained before, we want to collect data on our firm’s cases and clients, personnel and communication, resources, and risk management and uncertainty. Any record of fact that we have should be included in the data collection process. The more, the better. Once we collect our data, it is time to make an interpretation of that information. This is where machine learning comes in.

Machine Learning is not unlike the way we humans evolved to learn from patterns in our environment and to adapt our behaviors to those patterns. At its core, machine learning is the synthetic equivalent of eating a poisonous berry, getting sick, and avoiding that berry in the future.

But what if the berry was just rotten? What if something else in the environment caused our sickness? We may see patterns where none exist, and we may miss patterns that do exist. Human learning is fallible, and if we are not careful, our blind spots may hinder machine learning. Correlation can be imprecise, and causation can be incorrect. To combat these errors and inefficiencies, we need a better understanding of two core concepts of machine learning: the usage of training data, and generalization.

Training Data

The first step in constructing a machine learning model is collection and selection of training data. Training data are sets of starting data representative of the type of data that the model will ultimately use in practice. A firm’s specific needs will dictate what constitutes a good training data set. If our mission is to improve the efficiency of a human resources system, for example, we don’t want to have training data taken primarily from the accounting department. A good set of training data needs to be diverse and comprehensive within set parameters, but it also needs to have a set of decision trees within it that are not overly specific. In other words, we need to begin building our machine learning program with a data set that possesses generalization.


Generalization is the ability of a machine learning program to perform accurately on new, unseen examples and tasks after initial training. A general model is constructed off of our training data. After the training period, the program is fed additional relevant data. The ultimate goal is that the model built from the original training data responds to this new data with high efficacy, and continues to build upon initial success with additional new data.

There are some problems that can occur at this stage. If our model is not well generalized, or if our training data is not adequate, these problems can render our machine learning program ineffective. Next week, we will discuss some issues that can arise when data is not properly collected, training data is too narrow, or a model has runaway decision trees that are overly specific.

This article is part of our 7-part Intro series. The others can be found here.