We’ve gone over the basics of data collection, picking out what data to use for a training set, and how to begin the machine learning process. There are further steps to take, though.
In order to refine our model, we need to look closely at the set of decision trees that our training data created. Last week, we briefly mentioned decision trees. A decision tree is a visual representation of all the potential outcomes of a series of decisions. Each decision is represented as a “node”, and possible outcomes split off of each node like tree branches, creating various levels of specificity. A machine learning program uses these decision trees to build a complex probabilistic model. Danger can arise if these trees are too narrowly defined.
A model built on training data should never be overly complex. This would defeat the purpose. A model’s efficacy is determined not by performance on training data alone, but by its ability to perform well on unseen data. Overfitting is a hindrance to this idea. Overfitting is what happens when a decision tree is overly large and the model begins “memorizing” the training data. Underlying patterns are ignored, and outliers are retained rather than excluded.
Pruning and Cross-Validation
There are several ways to fix an algorithm that is overfit. Pruning is one such method. A powerful tool in our arsenal, pruning is a technique that reduces the size of decision trees by removing sections that do not add significant predictive power to the computations. If a node is not providing much additional information to the model, then we “prune” the “branch” from the “tree”. The goal of pruning is to reduce the size of a decision tree without reducing the predictive accuracy of the model.
Another way to strengthen our model is through cross-validation. This process begins after data collection. Instead of forming only one training data set, two similar sets are formed from the total pool of data. One set is used for training purposes, while the second is used as a kind of control, or testing set. This process is called partitioning. Once partitioned, the training data is fed through the model, and then the testing set is run through the model to establish validity. To increase the model’s robustness and reliability, multiple rounds of cross-validation can be performed using different data combinations in new training sets and testing sets.
This article is part of our 7-part Intro series. The others can be found here.