Customer Segmentation for Private Equity Due Diligences

Overview

Customer segmentation is frequently used in due diligence projects to uncover and interpret distinct consumer groups that can guide marketing, pricing, and growth strategies. This summary outlines an example project conducted for a dental clinic chain in Italy, based on data collected through a structured survey.

Latent Class Analysis (LCA) was selected as the modelling method because it is well-suited to survey data, particularly when the dataset includes a mix of categorical, ordinal, and continuous features. Unlike K-Means, which is a distance-based algorithm that works best with purely numerical data, LCA is model-based and assigns respondents to clusters based on probabilistic estimates. K-Means is typically reserved for other clustering use cases, such as lookalike analysis where the goal is to identify non-buyers who closely resemble buyers, in order to target the most potential buyers and expand the customer base.

Problem Statement

The client, a private equity firm evaluating investment in a dental clinic chain, required a segmentation of customers to better understand their behaviours, needs, and demographic profiles. The goal was to identify meaningful and actionable customer groups that could support the commercial due diligence process and provide insights to guide future marketing, pricing, and service design decisions.

The primary input was a large survey dataset. It covered a wide array of dimensions including demographics (e.g., age, income, employment), behaviours (e.g., frequency of visits, types of procedures), needs (e.g., preferences and key purchase criteria), and KPIs (e.g., visits and spend in the past year). The desired output was a respondent-level segment assignment and a high-level profile of each segment.

Methodology

1. Data Preparation

The first step involved organizing the survey variables into two categories: drivers and profilers. Drivers, such as behavioural habits and purchasing preferences, were used as inputs for clustering. Profilers, such as age, gender, and income, were excluded from the clustering process and instead used to describe and label segments post hoc. The dataset was cleaned and processed in Python, including necessary encoding for categorical features and appropriate handling of missing values.

One of the key inputs used in segmentation was a MaxDiff (Maximum Difference Scaling) question. In this method, respondents are repeatedly shown sets of attributes and asked to choose the most and least important attribute in each set. This helps to reveal the relative importance of attributes for each respondent in a more reliable way than standard rating scales. For the model, these responses were converted into individual-level importance scores and included as drivers.

2. LCA Modelling Approach

The segmentation was conducted using an internal LCA model, supported by a Streamlit interface that helps speed up repetitive steps - especially useful in fast-paced projects that involve frequent iteration and client interaction. The number of clusters was determined through iteration and qualitative feedback, based on business relevance and how well the segments aligned with real-world patterns. While statistical tools like AIC (Akaike Information Criterion), and BIC (Bayesian Information Criterion) are traditionally used to select the number of clusters, in this case business logic and interpretability were prioritized - particularly as segments needed to be actionable and intuitively understood by the client.

LCA assumes the presence of a latent variable that explains the patterns in observed responses. It operates using the Expectation-Maximization (EM) algorithm, which alternates between estimating the likelihood of each respondent belonging to each segment (E-step) and updating the segment definitions based on these assignments (M-step). The process continues until the model converges.

3. Output Preparation

The final stage of the segmentation process involved generating two key outputs. The first was a respondent-level dataset with segment assignments. The second was a segment profile matrix showing how each segment differed in terms of key profilers and drivers. Differences were shown in terms of over- or under-representation. For continuous variables, this meant identifying values that deviated by 5-10% or more from the overall mean. For categorical variables, it involved comparing the share of a specific category within the segment to its share in the total population - for example, if children made up only 5% of the total sample but accounted for 15% in a particular segment. Colour coding was used to visually highlight these variations, which helped guide interpretation.

These materials were then used in working sessions with the client to validate and label each segment. In some cases, generative AI tools were used to draft preliminary segment descriptions, helping accelerate the interpretation process.

Key Learnings

The importance of validating data assumptions manually when something seems off in the output. One anticipated segment (focused on dental emergencies) did not appear in the model output. A manual investigation of two key questions revealed low correlation and inconsistent respondent behaviour. Some respondents claimed they experienced emergencies but did not report visiting when in pain, and vice versa. This highlighted a data quality issue rather than a modelling flaw.
Understanding the distinction between drivers and profilers. Drivers should be limited to variables that directly inform customer needs and behaviours to form the clusters, while profilers are used only after segmentation to interpret and label the resulting segments. Making this distinction early ensures more coherent segments and smoother interpretation downstream. For example, demographic variables like age and gender are typically used as profilers rather than drivers, because they are broad descriptors that do not reflect underlying behavioural or needs-based differences that drive clustering.
Iterative collaboration with the client was essential for refining the list of drivers and profilers and validating segment usefulness.
Exposure to technical hard skills including:
- Applying Latent Class Analysis models on real-world survey data
- Processing and encoding mixed-type survey data (categorical, ordinal, continuous)
- Structuring and visualizing results through segment profile matrices

Keywords

Tools & Technologies: Python, Pandas, NumPy, Docker, Streamlit, Git, GitHub, VS Code, Scikit-learn, MS Excel

Tags: Latent Class Analysis (LCA), Expectation-Maximization (EM) Algorithm, K-means, Drivers and Profilers, MaxDiff, Cluster Labelling, Lookalike Analysis, Unsupervised Learning, Machine Learning, Gen-AI

Back to Projects