So far, we’ve thrown a lot of abstract words at you. We’ve supplied a couple of small examples here and there. Today is the day we bring it all together. This Intro series has defined and explored a lot of ideas related to machine learning. Now we will conceptualize an example machine learning program.
First, we need to define just a few more terms to get going:
– A data set includes individual kernels of data called instances. An instance is one data point in a data set.
– A task is a specific problem we want our machine learning program to solve.
– A binary classification task is a task where any one instance only has two possible values. This is the easiest machine learning task, and simplifies the construction of a training data set. Binary classification works just fine for certain tasks. Spam e-mail filters, for example (only two values: spam, or non-spam). This approach has obvious limits, though. Most data environments are more complex. Machine learning programs frequently utilize multi-class classification for data sets with more complexity and more features.
Multi-class classification is where the power of machine learning becomes quickly evident. If our data environment has enough complexity, we want to use a little bit of unsupervised learning to begin looking for interesting patterns and data clusters. Our goal is to locate patterns and clusters that do not necessarily have ironclad definitions. We don’t always know right away what we are looking for in our data, and unsupervised learning can help develop potential ideas. This differs from supervised learning, in which we may already know what kinds of patterns we want our algorithm to find.
There are many techniques for creating data clusters. The mathematical nuances of all the different types and varieties is beyond the scope of this blog (but NOT beyond the scope of LexSemble). It is perhaps enough to know that, even if we do not yet know how we plan to use our data, clustering algorithms can provide the first glimpses into how our data might be organized. Okay, on to an example.
Say you want to classify all of the books in your library into separate genres with as much accuracy as possible. Certainly, we don’t need a computer to quantify and categorize a library of a few dozen books. But what if our library has several dozen thousand? One type of clustering algorithm might sort the books based on length. Certain genres tend to be longer, while others tend to be shorter.
But we aren’t limited to only one data categorization. The only limit is our imagination. Another algorithm in our machine learning program might cluster books based on the average length of a sentence, or the average length of each word in each sentence. Historical fiction, hard sci-fi, biographies, and non-fiction works on economics and law tend to be verbose and perspicacious (e.g. by using words like “verbose” and “perspicacious”). By contrast, young adult, romance, and other genre fiction tend toward simpler language easily understood by a wider audience. These are just a few ways that the books in our library can be placed into helpful clusters. The key point here, is that we don’t have to make an end-value judgment of the data right away.
An analysis of the actual language used in the books could further subdivide our library. Certain words appear most frequently in legal texts (words like “jurisprudence”, “per curiam”, “amicus brief”, “remand” etc. etc.). Other words might only be expected in gothic fiction (“vampire”, “supernatural”, “witch”). The program could easily search through each text to find these patterns. Allowing the program to find and define numerous classes and clusters will help paint a detailed picture of our entire datascape.
The endgame of machine learning is what you make of it. Like any tool, how you use a machine learning program determines its efficacy. This is why it’s so important to have as large of a data set as possible. This is also why training data and cross-validation are critical first steps. Sufficiently trained algorithms can confront all kinds of problems, see patterns where a human may not, and streamline the data environment of any firm, large or small.
For more on the complexities of spam filter creation, and to see more in-depth analysis of many concepts covered in this Intro series, check out Peter Flach’s “Machine Learning: The Art and Science of Algorithms that Make Sense of Data”; Cambridge University Press, 2012. Be sure to also check out our own training service for professionals.
This article is part of our 7-part Intro series. The others can be found here.