This Data Science for Water Utilities chapter implements cluster analysis to segment customers using hierarchical clustering and k-means.

Clustering Customers to Define Segments

Peter Prevos | 10 May 2023
Last Updated | 9 June 2025
672 words | 4 minutes

Share this content

The ideal form of customer service is personal attention, where the needs of each individual are met. Unfortunately, this level of service is more often than not impossible or too costly to achieve. Service providers, therefore, segment their customers into groups with similar characteristics. Cluster analysis to segment customers is a commonly used technique, which analyses and divides an unlabeled dataset into groups of observations with similar properties. This chapter of Data Science for Water Utilities shows how to detect patterns and define segments in customer data. The learning objectives for this chapter are:

Understand the principles of customer segmentation
Apply and interpret hierarchical cluster analysis
Apply and interpret k-means clustering

Data Science for Water Utilities

Data Science for Water Utilities published by CRC Press is an applied, practical guide that shows water professionals how to use data science to solve urban water management problems using the R language for statistical computing.

Routledge Amazon

The data and code used in this chapter are available on GitHub:

Principles of Customer Segmentation

The ideal situation for customer-centric services is that each customer receives individual attention. For large organisations, giving each customer individual attention is very costly and treating everybody the same is not very good either. Customer segmentation helps service providers group customers into segments with similar needs.

Demographic: Age, gender, income, education, ethnicity.
Behavioural: Purchasing habits, spending habits, water consumption.
Psychographic: Interests, lifestyle, motivations, and water-related priorities.
Geographic: Town, postal code, water system.

Hierarchical Cluster Analysis

This example contains data from ten hypothetical customers (A–J). The first data dimension in the test data is the average annual water consumption, and the second is the size of the land on which the house resides. The clusters should be easily visible by viewing the image.

Hierarchical cluster analysis is a deterministic method to find the relevant clusters. This method reviews all possible combinations of data points and can thus be problematic when analysing large amounts of data. This tree diagram shows how all the points in the chart relate to each other.

k-means clustering

The k-means method uses a stochastic approach, which means that the outcome is not always the same when some clusters are in doubt. But, this method can digest much larger data sets than the hierarchical method. Another difference with the first method is that in k-means, you need to specify the number of clusters before the analysis starts.

The elbow method visualises the Within-Clusters Sum of Squares for the number of clusters. The location where the graph has the smallest angle is most likely the ideal number of clusters.

Interpreting cluster analysis

Clustering analysis methods are a form of unsupervised machine learning. The computer detects patterns in the data but cannot relate them to meaning. The results of a cluster analysis require human interpretation to link it to the context of the data.

In this simple example, we could name the two clusters of households with and those without a garden.

In reality, cluster analysis occurs with many more variables, as in these simplified examples. The book provides a small case study that also uses categorical data.

Cluster Analysis to Segment Customers Screencast

Chapter ten of Data Science for Water Utilities explains the principles of cluster analysis for customer segmentation in more detail. This screencast demonstrates how to undertake cluster analysis to segment customers using the code explained in the book.