The example
Imagine your company has a network of 17 warehouses supplying processors to your vendors.

You are fully restructuring your supply chain with:
- Building 3 or 4 new technology plants, for the new generation processors
- Providing deliveries through drones instead of trucks
- Each plant will have its drone airbase
- An airbase station will be built on each warehouse

https://commons.wikimedia.org/wiki/File:AMD_Ryzen_5_2600_(39851733273).jpg Fritzchens Fritz, CC0, via Wikimedia Commons

https://pxhere.com/en/photo/1323677

https://commons.wikimedia.org/wiki/File:Heliport_Niagara_Falls_Ontario.jpg — From Zwergelstern
Each day of the year, each shop will be supplied with a drone.
The transport cost calculation has been simplified: it depends only on the distance between the providing plant and the receiving warehouse, as the drone flies in a straight line and as the reasonable weight of the processors can be removed as a parameter from the cost estimation.
The main question
You know that you have a budget for several plants. You can build 3 plants without a loan, and you need at least 3 plants to avoid a shortage. With a loan, your company could build 4 plants.
Is building 3 or 4 plants more interesting to minimize transportation costs?
Naive solving with 4 plants
An educated guess is to determine 4 clusters of warehouse: having the constraint of drawing 4 clusters, my subjective own answer is to draw these clusters. Despite it is subjective, I believe it is the best answer.

And then, I visually and very approximately place the plants at the center of gravity of these clusters.

According to the above example simplification, the cost of transportation will be proportional to the sum of all distances between the supplying plants and the receiving warehouses:

Naive solving with 3 plants
Let’s imagine there is no possibility of a loan: where to place the 3 plants?
Where are the 3 clusters on this hereunder map?

Now we have an issue: we are not sure I can find the best 3 clusters, even subjectively.
First proposition:

Second proposition:

Third proposition:

Some other propositions could be guessed. The point of this example is, there is no, subjectively , obvious answer
There is at least some good news: for each proposition, we can measure the distances between plants and the warehouse, and sum them. Then we compare the sums, and the winner is the proposition with the smaller sum (meaning smaller global transport cost).
And there remains bad news: it is very tricky to find the best 3 clusters, and an efficient way to define them, instead of multiplying the propositions would be a relief.
The theory of K-means clustering
In our last example, the k-means clustering problem was to partition 17 warehouses into 3 clusters, in which each warehouse belongs to the cluster with the nearest distance to the center of the cluster.
A warehouse is called an observation.
Let’s rephrase in generic terms:
k-means clustering is a method of vector quantization, originally from signal processing, that aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean (cluster centers or cluster centroid), serving as a prototype of the cluster.
https://en.wikipedia.org/wiki/K-means_clustering
Please note mean is a more generic term for the center of gravity. It implies that every cluster has the same weight for building the cluster.
In an abstract way, the data science problem is the following: how to find 3 clusters for these 17 observations?

Complexity of solving the K-means clustering
This is no analytic calculation for finding the clusters: it means there is no calculation, providing the perfect solution.
The solution is usually an iterative process, and at each iteration, you may be closer to the perfect solution. At one iteration, you may consider that you won’t progress, and it’s time to stop.
This is an NP-hard problem, which means the solving time does not increase proportionally to the number of observations, for example. It increases like a polynomial function.
As an illustration, let’s compare 3 functions, with polynomial order 2, 3, and 4:

Applied to K-means cluster solving, the information is:

Indeed we don’t know if the biggest polynomial is x^2, x^3, x^4,.., x^15,…x^258,…. Just note the complexity is increasing much more quickly than the number of observations.
In conclusion, when there are too many observations, you can surely not count all the possible clusters and sum all the distances!
The K-means clustering classical algorithm
There is not a single algorithm in order to solve K-means clustering.
We are studying the classical one, which is easy to understand, and hence to develop. It is considered fairly efficient.
Let apply it for our 17 warehouses and 3 plants.
Initialization
It will be a graphical application, and unhappily even a visual & approximative for the determination of the centers of gravity, in order to keep a dynamic application of the algorithm.
We start with 3 arbitrary places for the plants.
The 3 initial plants have been set up:

Assignation
For each warehouse, we find the closest plant:

We have found 3 clusters:

The orange ellipses are very very unprecise.
Hence, it’s a little better to estimate the center of gravity of each cluster: yes, with visual estimation this is very unprecise and may lead to some mistakes. Surely, the solution with R programming will be more precise.
The new centroids have an orange color as the old plants become fade orange.

And now let’s check if the clusters have been modified: please note there is one warehouse that has changed of cluster (highlighted with a lighter blue, in the south)

Let’s draw the new clusters: indeed the north-east one has not changed of warehouses:

Graphically again, we determine the center of gravity of the clusters, based on the warehouses:

Again, we have one light blue warehouse switcher:

The new clusters are:

Let’s draw the new centers of gravity:

The clusters are unchanged:

No change: it was the final step of this graphically applied algorithm.
Our unprecise graphical final solution is:

Remark: there are some initialization algorithms which provide interesting starting point. The better starting point, and usually the quicker and/or the better convergence.
Classical algorithm description
As summary, the algorithm is:
- Step 1: Initialize the plants
- Step 2: For each warehouse, find the closer plants
- Step 3: Generate the warehouse’s clusters linked to the plants
- Step 4: Calculate the new centers of gravity of each cluster
- Step 5: Recalculate the plants’ position according to these new centers of gravity
- Repeat steps 2 to 5 until you don’t find a better solution; consider you don’t have significant improvement and don’t want to spend more time calculating.
Conclusion
The graphical and visual solution at least seems reasonable. But what was the impact of not determining accurately the centers of gravity?
Application with R programming
First, we create a table with the coordinates of the cities:

The coordinates of the warehouses (WH) are:
WH ID | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 |
X | 0 | 1 | 2 | 2 | 2 | 2 | 3 | 3 | 4 | 5 | 5 | 5 | 6 | 6 | 6 | 6 | 7 |
Y | 5 | 6 | 5 | 4 | 2 | 1 | 2 | 0 | 7 | 7 | 5 | 2 | 6 | 5 | 3 | 1 | 6 |
```{r}
# Let's write these warehouse coordinates
wh <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17)
x <- c(0, 1, 2, 2, 2, 2, 3, 3, 4, 5, 5, 5, 6, 6, 6, 6, 7)
y <- c(5, 6, 5, 4, 2, 1, 2, 0, 7, 7, 5, 2, 6, 5, 3, 1, 6)
warehouse <- data.table(wh, x, y)
warehouse
```


```{r}
# We call the required packages
library(data.table)
library(ggplot2)
```
```{r}
# Now we plot the cities
# ggplot( my_datatable, aes(my_abciss,my_ordinate))
# geom_point: print as point
# coord_fixed: I want the same scale in order to compare with the original map
ggplot(warehouse, aes(x, y)) + geom_point() + coord_fixed()
```


We have correctly displayed in R the 17 cities.
Let’s run the K-means function in CoreR. No need for a dedicated package, like ClusterR
```{r}
# in order to get reproducible calculations
set.seed(20)
# run the k-means function from R with 3 clusters with the 2nd and the 3rd columnn: meaning we remove the first column, as it is the ID of warehouse, not a coordinate
clusters <- kmeans(warehouse[,2:3], 3)
# print the result
clusters
```

What is interesting in the result?

- The first cluster has absciss 1.25 and ordinate 5
- The second cluster has absciss 3.86 and ordinate 1.57
- The third cluster has absciss 5.50 and ordinate 6.00

- The warehouses 1 to 4 belong to the first cluster
- The warehouses 5 to 8 and 12, 15, 16 belong to the second cluster
- The warehouses 9, 10, 11, 13, 14, 17 belong to the third cluster
How to fetch these strategic data in R? By Calling cluster and centers from the data.table clusters

Let’s add the according clusters to the warehouse data table:
```{r}
warehouse <- warehouse[, cluster := clusters["cluster"]$cluster]
View(warehouse)
```

Let’s draw the clusters:
```{r}
# we need to adapt the cluster as character, in order to get very different colors
warehouse <- warehouse[, cluster := as.character(cluster)]
# we add the color in the graph, according to the cluster
ggplot(warehouse, aes(x, y)) + geom_point(aes(color= cluster)) + coord_fixed()
```

```{r}
# create the clusters as data table
my_clusters <- setDT(clusters["centers"])
View(data.table)
```

```{r}
# Adding in black and big size the 3 clusters
ggplot(warehouse, aes(x, y)) + geom_point(aes(color= cluster)) + geom_point(data = my_clusters, aes(x=x, y=y), colour="black", size=5) + coord_fixed()
```

Let’s compare the graphical and R solutions:
Was the graphical application precise enough?


The R solution with the map:

Indeed the graphical solution was wrong with one warehouse, plus the position of the centroid which is perfectible:

Indeed, I’m suprised that the two top clusters are quite correct!
How effective is a four-plant cluster compared to a three-plant cluster?
We choose our metrix will be the average distance between warehouse and according plant.
# we create a column filled with the number of row. As row 1 contains cluster 1, it is logical to give it the name 1
my_clusters <- my_clusters[, cluster :=.I ]
View(my_clusters)

Let’s create a big data table which will contain both warehouses coordinates and clusters coordinates:


```{r}
# in order to be consistent, we define the cluster ID as character
my_clusters <- my_clusters[, cluster := as.character(cluster)]
# we rename the coordinates of my_clusters in order to avoid duplicate name with future merge
colnames(my_clusters) <- c("xx","yy","cluster")
View(my_clusters)
```

We join the 2 data tables, with the column « cluster » as join condition:
```{r}
extended_warehouse <- merge(x=warehouse, y=my_clusters, by = "cluster")
View(extended_warehouse)
```

```{r}
# calculation of the distance between the point [x,y] and the point [xx,yy]
extended_warehouse <- extended_warehouse[, distance := sqrt( (x-xx)^2 + (y-yy)^2 ) ]
View(extended_warehouse)
```

```{r}
my_mean <- mean(extended_warehouse$distance)
my_mean
```

Now let’s calculate for a 4-plants. You can reuse the whole above code; only replace three by four inside the function kmeans
```{r}
# The calculation with 4 plants
# in order to get reproducible calculations
set.seed(20)
# run the k-means function from R with 3 clusters with the 2nd and the 3rd columnn: meaning we remove the first column, as it is a description of warehouse, not a coordinate
clusters <- kmeans(warehouse[,2:3], 4)
# print the result
clusters
```

The average distance is:

Indeed:
3 clusters | 4 clusters | comparison from 3 to 4 clusters | |
average distance (in an arbitrary unit) | 1.415 | 1.054 | – 26% |
cost of plant building (considering an arbitrary amount of 1 per plant) | 3 | 4 | + 33% |
Please keep in mind, that for our business example, the -26% cannot be compared to +33%. Because the cost of plant building is one shot, for the lifetime of the plants, and the cost of transport (proportional to the average distance) depends on the duration. Plus we should take into account the overheads of plants and drone logistics.
Conclusion
We have seen a very analytic application. Keep in mind, K-means can have also many other applications.
For example, consider you have 17 VIP customers. You have collected data on their gross sales and employee numbers. The marketing team could ask you to define the different types of these customers.
Hence, run many k-means, and visually (or with a metric, but which one?) estimate the most relevant types:





How many types seem to fit best?
In my point of view, visually, 4. Of course, this is so subjective, and it is important to adapt the answer to the business context. For example, the Southeast company could be considered an outlier and even dropped by the marketing team so as not to spend time on this atypical customer.