The KNN (K-Nearest-Neighbor) Classifier
-
Introduction
We often need our program to classify things, which means grouping things into different groups (called “classes”).
For example, suppose a small town has 2 middle schools, “West Middle School” and “East Middle School”. Students are assigned to one of these 2 schools based on their home addresses. Given a new home address, we can create a program that tells us which school it maps to.
There are many ways to build a classifier. The KNN (K-Nearest-Neighbor) classifier is one of the most commonly used classifiers because it is simple and fast.
The basic idea is that we assume we already know the classes of some data points. Given a new data point, we try to find K (such as 3) number of the known data points, which are “nearest” (closest) to this new data point. These points are called the K nearest neighbors.
Once we have found these K data points, we simply need to take a majority vote among them. If most of them are of one class, then we predict the new data point is of the same class.
In the example of school assignment, given a new home, we can go through all the homes that we already know about their schools, and find 3 homes that are closest to this new home. If we find 2 or more of these 3 known homes are assigned to the West school, then most likely the new home is also assigned to that school.
Preparing the Known Data Table
Since the KNN classifier relies on data points that we already know about, we need to put these data points into a table first.
Each row in the table will represent one data point. If you have 1000 known data points, then this table should have 1000 rows.
The first column of the table has to be called “label”, and it should contain the label (i.e. the “class”) of each data point. Note that there can be 2 or more different values for the label.
The remaining columns should contain the properties of each data point. All values have to be numbers, since they will be used to calculate the distance between points.
Here is an example table called “known data” for the school-assignment problem. It has 16 rows, which represent 16 homes. Each home has a label for “West” or “East”, and also the X and Y positions of that home.
Here is a map view of these homes. Blue homes are for the “West” school, and red homes are for the “East” school.
Create the KNN Classifier
To create a new KNN classifier, you need to use this block in the AI category (premium only).
3 parameters are required:
- Table for Known Data: You need to select the name of the table with known data points.
- K: The number of neighbors to search for. Note that K should be an odd number, since you would like to do a majority vote among the K neighbors for the final prediction. If you have equal numbers for each class among the K neighbors, it will be hard to choose.
- Classifier Name: You can create more than one classifier in each project. Therefore, you have to name each classifier, so that you can refer to it later.
After running this block, the classifier is ready for making predictions.
Preparing the New Data Table
To use the classifier to make a prediction on some new data points, we need to put these data points into another table. This table should have at least one row, and the same columns as the known data table: with “label” as the first column, and the other columns containing properties of each point. The “label” column will be filled by the classifier.
For example, here is an example of the new data table, with the X and Y positions of 2 new homes.
Classify the New Data Points
To classify the new data points, you can run this block:
-
New Table Name: this is the name of the table that contains the new data points to be classified. Note that it has to be a different table from the known data table. If you select the known data table, this block may overwrite the “label” data in it with wrong values!
-
Classifier Name: this is the name of the classifier you have already created using the “create KNN number classifier” block.
-
With Neighbors or Not: the last input allows you to control whether to output the list of neighbors found for each new data points. This will be useful for understanding how the prediction is made. If you choose “Yes”, a new column “neighbors” will be added to the new data table automatically. For each row, the value in this column will be a list of K numbers separated by commas, which tells us the row numbers of the K neighbors in the known data table.
Demo 1
Here is the prediction result for the school-assignment problem. As you can see, the labels have been filled in, and the “neighbors” column shows the list of 3 indices for the 3 neighbors used.
For example, for the first home, its X position is -150 and Y position is 130. Based on that information, the classifier has found 3 known homes, whose row numbers are 1, 16 and 2.
Here is a map view of the prediction. The 2 new homes are in green. The new home on the left has a blue square around it, which means our classifier thinks it belongs to the West school. The new home on the right is in a red square, which represents the East school.
Note that we are also plotting the 3 nearest neighbor homes for each of the new homes.
Demo 2 - Choosing the K Value
In general, the larger the K value, the more time it takes to find these nearest neighbors, but also the more accurate the result.
For example, among the known homes, the one on the left top has a label for the “East” school. This is called an “outlier” since it is not consistent with the other data points. This may be due to a mistake made when inputting the data, or due to a special rule that we do not know about.
In this case, if we use a K value of 1, then the new home on the left will be assigned a label of “East”, because the 1 neighbor nearest it is from the “East”.
If we use a K value of 5, we would have a 4 vs 1 vote on the new home on the left:
You can play with this demo project here:https://play.creaticode.com/projects/bf1be94f673e792fd6d7b121