K-means is an unsupervised partitioning algorithm. It’s used to organize unlabeled data into groups or clusters. What does this mean? Suppose we have some data with two columns of continuous data. Suppose it is length and width the bills of of two different bird species. We don’t know what type each bird is, but we want to create two groups as we suspect that one bird species is slightly larger, in general than the other bird species. Each ‘group” is a cluster.
With K-means, each cluster is defined by a central point or a centroid. Its position represents the center of the cluster, also known as the mathematical mean. Hence the name K-means.
Steps to Building a K-Means Model
There are four steps to building a K-means model. In brief, here are the four steps.
- you choose the number of centroids (K) and place them in the data space
- assign each data point to its nearest centroid
- recalculate the centroid of each cluster
- repeat steps 2 and 3 until the algorithm converges
K-means is a partitioning algorithm. However, data professionals typically talk about it as a clustering algorithm, although it is not. The difference is that outlying points in clustering algorithms can exist outside of the clusters. However, for partitioning algorithms, all points must be assigned to a cluster. In other words, K-means does not allow unassigned outliers. K-means is an unsupervised learning technique that groups unlabeled data into K clusters based on similarity.
Clustering items together in a scatterplot is an example of the Gestalt principle of proximity.