Introduction to K-Means Clustering
Zenva
ACCESS the FULL COURSE here: https://academy.zenva.com/product/data-science-mini-degree/?zva_src=youtube-datascience-md
TRANSCRIPT
In this video we are going to look at probably the most popular clustering algorithm called K-means clustering. So K-means clustering, the point of it is to take our cluster data and then separate it into K disjoint clusters, and these clusters are defined so that they minimize what's called the within-cluster sum-of-squares, but we'll get to that a bit later. You'll see what that means when we discuss a little bit more about K-means convergence. So the point of K-means clustering is to separate the data into K disjoint clusters. What I mean by disjoint clusters means that one point cannot belong to more than one cluster, and the only parameter that we have to set for this algorithm is the number of clusters. So you'll see an example of an algorithm where you need to have some notion of how many clusters you want your data to have before you run the algorithm. Later, we're gonna discuss how you can select this value of K, because sometimes it's quite obvious how many clusters you have, but many times it's not quite so clear how many clusters you should have, and so we'll discuss a couple techniques a bit later on how to choose this parameter K. So K-means clustering is very popular, very well-known. It actually, the algorithm itself is quite simple. It's just a couple lines of algorithm code and the code itself is writing, if you had to write it from scratch also, it's not something that would take you a long time. It's very popular. I mean, it's taught in computer science curriculums very commonly. So knowing this algorithm is kind of the first step to getting more acquainted with clustering. It's kind of the baseline algorithm, and we'll move on to more complicated algorithms a bit later, and then one point of terminology here. You'll often hear something called a cluster center or a centroid, and really that's just a point that represents a cluster, so in the figure that I have here on the bottom right, we have two clusters. We have a red cluster and a blue cluster, and the X is the centroid. In other words, the centroid is really like the average of the X coordinates and the average of the Y coordinates and then that's the point. So you can see in the blue, the average, if I average the X coordinates is somewhere gonna be in the middle, and if I average the Y coordinates, that's gonna be somewhere in the middle and we end up with some centroid that's in the middle of that square, and so that's what I mean by centroid. So that's all we really need to know about K-means. Now we can get to the algorithm, we can actually get to the algorithm. I'm going to go through an example with you so you can kinda see it progress s ... https://www.youtube.com/watch?v=MzV6Q8Y-tus
16206514 Bytes