Machine Learning, Part 2: Classification

Reading Time: 10 minutes

I’m working my way through a Coursera specialization on Machine Learning. The specialization includes several courses, the first of which provides a high-level overview of ML. I finished that one before I began talking about the coursework on my blog because I didn’t want to identify myself as a student of machine learning until I had actually gone through with something.

Going forward, I’ll share a post on each of the in-depth classes in the specialization. The first in-depth class was called Regression, and you can find my post about that course right here. The second class is called Classification, and this post will cover key concepts from the course as well as my observations and any supplementary readings that help me understand the material.

**Publishing unfinished. Will update as course continues.

Module 1: Linear Classifiers & Logistic Regression

A linear classifier classifies items as one thing or another; we can model it by plotting all of our data and drawing a line that separates the points classified as one category from the points classified as the other.

There are other. more complex classifiers that classify groups with more features and into more categories, and we will get to those in later modules.

A linear classifier will take a training set of pre-classified data and learn which of the features most correlate to one category versus the other. Then the model can score new data based on those features: high scores are the most likely to be in one category and the lowest scores are the most likely to be in the other category.

A logistic regression model squeezes all of these scores into a range from 0 to 1: the lowest scores are close to zero, and the highest ones are close to 1. These scores represent the probability that a data point belongs in one category versus the other.

Screen Shot 2016-03-21 at 8.58.09 PM
Screen Shot from the course, complete with floating professor.

We can use linear classifiers to model data with more than two categories: we just have to do it multiple times. For each of our categories, we must train a one-versus-all classifier; it models the probability that each data point belongs to one category versus all of the others. So for a dataset with five categories, we would have to do this five separate times.

Module 2: Learning Linear Classifiers

When we define a linear classifier with a logistic regression function, how do we define the best fit? A logistic regression function outputs, for a given data point, the probability p  that that point falls into one of the categories (with 1-p representing the probability that the data point falls into the other category). So the best fit function will maximize the value of p for data points that go into one category (represented by Y=1),  and it will minimize the value of p for data points that go into the other category (represented by Y=-1).

We can calculate the data likelihood of the function: that is, a measure of its accuracy. By taking the derivative of this, we get a function that, when minimized, leads us to the maximum data likelihood. We can get there by gradient ascent—taking small steps in the weights of all the coefficients in the function until the we arrive at the “top of the hill” of possible data likelihoods, where the derivative is zero (or sufficiently close to it). This handy presentation discusses the details of data likelihood and its derivative so that you or I can perform those calculations ourselves. Additionally, this article helped me to understand the intuition behind the math.

When we try to maximize the data likelihood, we want to watch out for overfitting. By converging at extremely large coefficients, a logistic regression function can reach a point where it models test data perfectly. However, an overfit function will generalize poorly to any data set other than the test data set (high variance, low bias. See how it all fits together?). It will also predict points with too much confidence. So a point that is very close to the decision boundary will get a p value of, say, .95, even though only 60% of points this distance from the decision boundary end up being in the category that this point is in. In this case, a more accurate p value would be lower than that. Here is what an overfit decision boundary might look like next to smoother, more realistic ones:

High Bias/Low Variance——————————————————-Low Bias/High Variance

Columbia Data Science - Overfitting

We can correct for this with logistic regression functions using the same tools that we used to correct for it with linear regression functions: using an L2 penalty (represented by the sum of the squares of the coefficients) and using an L1 penalty (represented by the sum of the absolute values of the coefficients).

Like in linear regression, L2 regularization in logistic regression will push all coefficients toward zero. Also like in linear regression, L1 regularization in logistic regression will push some coefficients to zero, so the terms attached to the coefficients that have the least consistent effects on the data will get removed. For a review of these concepts, see this post.

Modules 3 & 4: Decision Trees

A decision tree classifies data by sorting data points into groups based on one of their features, then sorting each of those subgroups on another feature, recursively down the line until one of the following conditions is met:

  1. There are no more features to sort on
  2. All of the data in all of the groups has the same outcome variable
  3. The tree has reached the maximum depth (complexity) that the researchers chose
  4. The nodes at the bottom have fewer data points than a minimum node size that the researchers chose
  5. An additional split reduces the error of the predictions by less than a minimum amount that the researchers chose (not enough benefit for added complexity)

A decision tree looks like this:

Thank you to Eren Golge for the image.
Thank you to Eren Golge for the image.

A decision tree has made an error if a data point at one of the leaf nodes (the last nodes, where there is no more tree below them) gets classified wrong—for example, in the above tree, a sort-of-yellow orange would get misclassified as a grapefruit.

Training a decision tree can get complicated, and there are lots of ways to do it. But for now, let’s stick with the greedy model of decision tree classification. This is how it works: we split all of our data on each feature that we might want to use in the decision tree. Then we calculate the error (misclassifications) that each split would produce. So suppose we have 16 items, each with color and shape, and classification C. If we split them on color, we get a group of 9 and another group of 7. In the group of 9, 6 have one value for C, and 3 have a different value for C. In the group of 7, the split is 3 and 4. So At least 3 are misclassified in each group, resulting in a total error of 6/16, or about 0.37. If we split on shape, we have a group of 10 and a group of 6. In the group of 10, 9 have one value for C , and 1 has a different value. In the group of 6, they all have the same value for C. So the error is only 1/16, about 0.06—much lower than the other split. So we split on shape first. We keep doing this recursively down the tree until one of the stopping conditions (see above) is met.

The interesting thing about decision trees is that they sort of model the way human beings make decisions. We pick our most important criterion, then think about our second most important criterion, et cetera until we have a decision. This makes the tree a more intuitive learning model than some of the others.

Module 5: Boosting

In 1988, computer scientists Leslie Valiant and Michael Kearns posed a question: can a set of weak learners be combined to form a stronger learner? The question was answered one year later when Robert Schapire published a paper on boosting. He and Yoav Feund came up with the first boosting algorithms—strategies that train a series of weak learners on a set of data and then give each learner a “vote,” with different weights based on each learner’s accuracy, as to what the prediction should be for each piece of data in a dataset. The most famous boosting algorithm, called AdaBoost, uses decision trees as its base, but the concept of boosting can be used with a wide variety of machine learning techniques.

Module 6: Precision and Recall

Up until now, we have been evaluating our models largely by accuracy—that is, number of classifications that are correct divided by the total number of data points. However, some kinds of inaccuracy matter more than others. In many situations, a false positive and a false negative prediction have different consequences associated with them. For these situations, we differentiate between kinds of inaccuracy with precision and recall.

Recall describes the total number of positive examples that the model guessed correctly divided by the total number of positive examples. For example, if a database contains 100 things relevant to our search and our search algorithm finds 86 of them, our recall would be 86%.

Precision describes how many of the examples predicted to be positive that the model guessed correctly. For example, if our search algorithm finds 100 things and only 57 of them are actually relevant to our search, our precision would be 57%.

When a false positive is costlier than a false negative, we want to optimize for precision. When a false negative is costlier than a false positive, we want to optimize for recall. This quick, helpful read from Creighton University contains some helpful diagrams for making precision and recall easier to understand.

Module 7: Gradient Ascent vs. Stochastic Gradient Ascent

I wouldn’t do nearly as good a job of explaining the differences between gradient ascent and stochastic gradient ascent as Sebastian Rachka does in this article, so I recommend it for reading up on this topic.

Why would we apply stochastic gradient ascent? Well, this type of learning from our data takes much less time per update than gradient ascent, and it works well if we want to change our predictions based on data points getting added to our dataset one by one in real time (this is called online learning). It’s also helpful for scaling to large datasets, like the sum of the video data collected by YouTube, for example.

This covers all of the modules in the Classification course of this course series. Next, I believe, is recommenders.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.