AI - KNN Classifier for Diabetes (Premium Only, Difficulty: 4)
CreatiCode last edited by info-creaticode
AI can be used to diagnose many common illnesses, which allows people to monitor their health continuously. In this tutorial, we will build a computer model that predicts whether a person has diabetes. We will be using the K-Nearest-Neighbor number classifier, which makes its decision based on other people similar to the target patient.
We will be using a public domain dataset from Kaggle. Kaggle is an online learning platform for machine learning and data science.
This dataset contains health data on 768 patients, and whether each patient is diagnosed with diabetes or not.
- Open this link: https://www.kaggle.com/datasets/mathchi/diabetes-data-set
- Scroll down the page, and click the download button, which should download a file named “diabetes.csv”:
Now please create a new project named “KNN for Diabetes” on the CreatiCode playground, remove “Sprite1”, and rename the “Empty1” sprite as “Data”. We will use this sprite to hold all code for handling the dataset.
To work with the dataset we have downloaded, we first need to upload it into our project as a table.
Create a new table named “raw data”, then right-click on the table header in the stage, and select “import”. Find and select the “diabetes.csv” file you have downloaded. Usually, it is in your “Downloads” folder.
The “raw data” table should now contain 768 rows in 9 columns. Save the project now, so that this table will be stored as part of the project.
Before we build any model, it is always a good idea to browse through the raw data. It will help us understand the data better, and also allow us to check if there is anything wrong with the data, such as missing values or incorrect values.
For example, if you scroll left to the first few columns, and scroll down to row 76, you should see some issues:
For row 76, the second column is 0. This is for “Glucose”, which is sugar in the blood. It should not be 0, which means the data is missing for this patient. Either they did not have a chance to measure the Glucose level for this patient, or they forgot to include it in the raw data.
Similarly, for row 79, column 3 is 0. This column is the blood pressure of the patient, which also should never be 0.
In later steps, we will skip such invalid rows to improve our model’s accuracy.
The last column of the raw data is “outcome”. It represents whether this patient is diagnosed with diabetes or not. Can you find any pattern in the data for diabetes patients? Such intuitions will be crucial for designing your classifier model.
Next, create a new table named “known data”. It will be used as the input table for the KNN classifier.
To keep our classifier model simple, let’s start with only 2 columns: the second column “glucose” and the sixth column “BMI”. We can use more columns later.
The new table “known data” will have 3 columns: “label”, “Glucose” and “BMI”. Note that the “label” column will be the “outcome” column from the raw data.
Although we can add these 3 columns manually, it is preferable to use a few blocks to do it, so that we can reset the table at any time point. Please make a new block named “prepare known data”, and add these blocks to its definition:
Make sure you select “known data” in the drop-down, not the “raw data” table. These blocks will remove all the columns first, so the table becomes completely empty with no rows and no columns. Then we add 3 columns at position 1/2/3.
If you click this stack of blocks, they will change the “known data” table to this:
Next, we will copy some raw data into the known data table. We are going to copy only the first 700 rows of the raw data to train our model, and use the remaining 68 rows to test our model.
We will use a for-loop to call the “add to table” block 700 times, and the variable “i” will represent the row number:
For each row in the raw data, we need to extract 3 values in the “outcome”, “glucose” and “bmi” columns. The row number will be the variable “i”. Note that when you drag the “item at row/column” blocks, because they are quite wide, you need to align the left tip of the block with the input box.
After running the “prepare known data” stack again, the “known data” table now has 700 rows, although only the first 100 rows are displayed:
Since both “glucose” and “BMI” have to be greater than 0, we need to skip those rows with 0 values for one or both of these columns.
Now if we run this stack again, we see that the “known data” table has 685 rows instead of 700. In other words, 15 rows with some 0 values are ignored.
Next, we will prepare another table named “test data”. It will contain the remaining rows from the raw data table. They will be used for testing the accuracy of our classifier.
This table has to start with the same 3 columns as the known data table: “label”, “glucose” and “bmi”. The “label” column will be left empty, so that our classifier can write its prediction into this column. To make it easier to compare the prediction with the actual diagnosis, we will add a fourth column to the “test data” table called “truelabel”. This column will be ignored by the classifier, but we can use it to compare with the model output later.
We can define another block “prepare test data” like this:
Click on this new stack to add the 4 columns to the empty table:
The code to copy into the “test data” table is very similar to that of the “known data” table. The key difference is now we are reading the rows from 701 to 768, and also we are copying the outcome into the “truelabel” column, and leaving the “label” column empty. Don’t forget to select the “test data” table.
After running this stack again, the “test data” table should contain 67 rows (one row of raw data is skipped):
Note that the “label” column is left empty. After this step, our data tables are ready.
For an optional step, you can plot the data in the “known data” to gain some additional understanding. We can use a small dot to represent each patient, and use “red” dots for diabetes patients and “green” dots for other patients. Each dot can be represented by a small rectangle of width 2 and height 2, and border width of 0. Since each row only has 2 measurements, we can use “glucose” as the x position and “bmi” as the y position.
When we run this stack, we get a “cloud” of red and green dots.
We can observe that green dots are denser on the left, and red dots are denser on the right. That means if the “glucose” is larger, then it’s more likely this patient is diagnosed with diabetes.
In addition, the green and red dots are mixed up in many areas, which means it would be somewhat difficult to use the nearest neighbors for classification.
Now let’s create a new sprite named “model”, which will hold all the blocks related to our classifier model.
In this sprite, we will create a KNN classifier named “c”. To start simple, let’s just use the 3 nearest neighbors. That is, for each unknown patient, we will try to find 3 known patients with similar “glucose” and “bmi” measurements. If 2 or more of these 3 patients have diabetes, then predict the unknown patient also does.
Note that you need to select “known data” from the drop-down.
Next, let’s make some predictions using the classifier. Note that the “test data” table is selected, and we have chosen to show the 3 neighbors used for prediction:
Now if you click the green flag, the model will run the prediction and write the results into the “test data” table. Given the small amount of data, it should finish the task instantly. Take a look at the “test data” table. It should now contain data in the “label” column, and a new column named “neighbors” should be added automatically. For each row in the “test data”, this column contains 3 numbers, which are the row numbers of the 3 most similar patients in the “known data” table.
The accuracy of the classifier can be defined as the percent of test data that we have correctly predicted. Given that there are 67 rows in the “test data” table, we just need to count for how many rows the “label” given by the model matches with the “truelabel” from the raw data.
To do that, we can create a new variable “correct count”, and set it as 0 first. Then we go through each of the 67 rows in the “test data” table, and if we find the “label” value is the same as the “truelabel” value for that row, we add 1 to “correct count”. Lastly, we can divide the correct count by 67 to get the accuracy percentage.
You should get an accuracy of about 67%. In other words, for every 3 new patients, our model will classify 2 of them correctly on average.
To improve our classifier’s accuracy, we can try to use more neighbors, so that our classifier is not misled by a few “bad neighbors”.
For example, if we change the neighbor count from 3 to 13, the accuracy would jump from 67.1% to about 80.5%:
The classifier we have built is relatively simple. You can try to improve its accuracy in a few ways:
Different Neighbor Count: Which neighbor count parameter would give us the highest accuracy?
Different Columns: Instead of using “Glucose” and “BMI”, would some other columns be more helpful? You can try to change or add columns to your known data and test data.