Hierarchical Clustering Through an Example
I have taken a problem statement of an NGO wanting to find the top 5-10 countries from a list of 169 who are in dire need of aid, in the article: K-Means Clustering Through an Example. Today I would like to solve the same problem using hierarchical clustering.
Just to reiterate the problem statement, an NGO that is committed to fighting poverty in backward countries by providing basic amenities and relief during disasters and natural calamities has got a round of funding. It needs to utilise this money strategically to have the maximum impact. So, we need to be able to choose the countries that are in dire need of aid based on socio-economic factors and health factors.
I would highly recommend that you go through my article on K-Means to understand the solution thinking, data cleansing, exploratory data analysis and data preparation steps.
Here I would like to just touch upon the Hierarchical modelling aspects instead of the K-Means algorithm used in the previous article on K-Means.
NOTE: The data and code for this article is fully available in my Github repo as a jupyter notebook. You can look at it in detail and execute it to understand the complete flow.
In the Notebook, steps 1 to 4 are all around data understanding, cleaning and preparation which remain the same irrespective of the type of clustering that we are aiming to work with. these steps have all been detailed in the K-Means Clusering article, already mentioned.
Here I go directly to Step 5.
Hierarchical Clustering
I use the scipy library here instead of the normally used scikit in earlier examples.
So, the three imports I have done are:
from scipy.cluster.hierarchy import linkage
from scipy.cluster.hierarchy import dendrogram
from scipy.cluster.hierarchy import cut_tree
linkage is a library that allows you to choose the type of linkage, which has been recently discussed in my article on types of linkages for Hierarchical clustering. One has to keep in mind the size of data on hand and the order of complexity of computation that would be required to arrive at the clusters while deciding the linkage type. of course, you also want as distinct a set of clusters as possible. This is the balance that has to be ensured at this step.
The dendrogram routine in the scipy package helps you visualise the dendrogram created by the hierarchical model. The cut_tree routine helps in creating the clusters by cutting the dendrogram into the number of clusters you want to get.
As seen in the article on linkages, the single linkage model is created by just one line of code:
h_model_1 = linkage(country_scaled, method="single", metric='euclidean')
dendrogram(h_model_1)
plt.show()
Then the dendrogram obtained is:
Clearly, this is hardly interpretable and clean. This relies on taking the smallest distance between clusters as the measure of dissimilarity.
We then try complete linkage to see if we get a better dendrogram:
h_model_2 = linkage(country_scaled, method="complete", metric='euclidean')
dendrogram(h_model_2)
plt.show()
This creates a much cleaner dendrogram and you can see at what level you may have clear distinct clusters formed. You could choose to have 2,3 5 or even 6 clusters depending on your business case.
In the jupyter notebook, I have first decided to go with 3 clusters and use the cut_tree routine to achieve the same:
h_cluster_id = cut_tree(h_model_2, n_clusters=3).reshape(-1, )
h_cluster_id
Then, I assign the cluster id so obtained, to the country dataframe as seen here:
country_hier3 = country_df.copy()
country_hier3['cluster_id'] = h_cluster_id
country_hier3.head()
And if I were to count how many countries are in each of the clusters, I see this:
When I profile these clusters, I realise that cluster 0 is the one containing poor nations that need aid and there are 50 countries here. That is not helpful for me to get back to the CEO saying the money on hand is needed for 50 countries. No one would benefit from this.
Hence I now move to cut the tree for 5 clusters. This improves the numbers for me. How do I understand this?
Profiling
Let's have a look at some of the profiling steps.
I plot a scatter plot of the 5 clusters as shown here:
plt.figure(figsize = [15,10])
plt.subplot(2,2,1)
sns.scatterplot(x = "gdpp" , y = "child_mort", hue = 'cluster_id', data = country_hier5, palette = "Set1", legend = "full")
plt.subplot(2,2,2)
sns.scatterplot(x = "income" , y = "child_mort", hue = 'cluster_id', data = country_hier5, palette = "Set1",legend = "full")
plt.subplot(2,2,3)
sns.scatterplot(x = "income" , y = "gdpp", hue = 'cluster_id', data = country_hier5, palette = "Set1", legend = "full")
plt.show()
There are a whole host of countries represented by the red dots that seem to have a very low GDPP and high child mortality, similarly low income and child mortality and finally low income and low GDPP.
We can get another view by plotting a bar graph:
country_hier5.groupby('cluster_id')['gdpp','child_mort','income'].mean().plot(kind = 'bar')
The scale of GDPP and income of the better-off countries is so large that the child mortality numbers are hardly visible. In spite of that, you can see it is clearly visible in cluster 0. Hence that seems the cluster with the poorest nations.
However, how many countries are part of cluster 0, let us check.
There are 38 of them. The 50 earlier have been split into cluster 0 with 38 and cluster 3 with 12.
Let us also get an idea of the spread and the median of the 5 clusters around GDPP, income and child mortality by plotting box plots:
plt.figure(figsize = [15,10])
plt.subplot(2,2,1)
sns.boxplot(x='cluster_id', y = 'gdpp', data = country_hier5 )
plt.subplot(2,2,2)
sns.boxplot(x='cluster_id', y = 'child_mort', data = country_hier5 )
plt.subplot(2,2,3)
sns.boxplot(x='cluster_id', y = 'income', data = country_hier5 )
plt.show()
Absolutely clear that the child mortality spread and the median is high for those countries with the lowest income and GDPP.
Then, we can prioritise amongst these countries by sorting on child mortality, GDPP and income as they seem to be the indicators that we can choose for prioritisation:
country_hier5[country_hier5['cluster_id'] == 0]\
.sort_values(by = ['child_mort','gdpp', 'income'], ascending = [False,True,True])['country']
The top 10 list looks like this:
With this, we have some conclusions to represent the data and the suggestions to the CEO on the utilisation of funds.
Each of these pieces of code is simple and very easy to understand and execute in a Jupyter notebook. Do try it out yourself.
Wishing you a great time exploring the code and adding your own nuances to it.
Once again the data and the code is available in my git repo at
Commentaires