Clustering consumer behavior with mixture models

Market Basket Analysis Framework

As part of our continued efforts to improve our market basket analysis framework, we’ve been experimenting with mixture models for clustering transactional datasets as a pre-processing stage for rule mining and association analysis. The goal is to group customers according to their purchase behavior, and then obtain the association rules of each particular cluster.


Mixture models have become commonplace in machine learning as a sound statistical framework for unsupervised classification (a.k.a. clustering).  The idea is to combine several probability distributions into a collection that, in turn, is a probability distribution for which parameters can be estimated. The following figure shows a trained mixture model composed of three Gaussian distributions, each having with its own mean and variance parameters, which can be estimated using the Expectation-Maximization (EM for short) algorithm1. 


Working with categorical and count data 

The market basket structure present in many transactional datasets is categorical by nature, which means that employing continuous distributions (such as the Gaussian distribution) is incorrect. Therefore, the components of our mixture model must be discrete, categorical distributions. Mixtures of categorical or multinomial distributions have been successfully applied in many areas, particularly in document analysis, a field known in the research literature as topic modeling. Many topic models assume that the frequencies of words in a document follow a multinomial distribution with distribution parameters that follow from a particular topic/cluster: by examining a corpus of  documents in the bag-of-words format (a map that counts the number of times a given word appears in the document), topic models are capable of inferring the word distributions pertaining to each topic, and classify documents accordingly. 
It comes as no surprise that some researchers and practitioners have examined the application of topic models to market-basket data. After all, a transaction (or a client's set of transactions) can be seen as a document in which the items (words) appear with a certain frequency (number of times an item is bought), and topics are clusters that might correspond to market segments or prototypes of consumer behavior. Hruschka [1] examines the usability of Latent Dirichlet Allocation (LDA) and Correlated Topic Models (CTM) to market basket analysis, concluding that LDA in particular provides a high degree of interpretability. However, mixture-of-multinomials models have been proposed earlier [2].

LDA: Clients vs. documents

At Singularities, we have been experimenting with different mixture models and datasets. Our first approximation was to use LDA on a membership restaurant dataset by aggregating each client's transactions separately, so that “each” document contains the number of times a given client has bought a particular item. As a result, we obtained a highly interpretable model. However, we identified two drawbacks with this approach:
LDA models each document as a mixture of topics. In our example, a given client could be categorized as {Topic1: 40%, Topic3: 40%, Topic9: 20%}, where Topic1 corresponds to veggie patty sandwiches, Topic3 corresponds to soft drinks, and Topic9 corresponds to desserts. As a result, there is no single “cluster” to which a given client can be assigned, and most criteria (e.g. selecting the topic with the highest probability) are to some extent arbitrary. Certainly, this can be overcome by adjusting the Dirichlet priors' concentration parameters α and β before training so that it results in a more sparse model; however, this requires case-specific tuning (hard!) and many implementations (for instance, Spark's MLlib) restrict the range of possible values for these parameters.
Empirical results in the literature have shown that LDA tends to underperform in corpora comprised of short documents [3]. This affects the stability of the topic assignments in our case, particularly for clients with a relatively small number of transactions.
To deal with the above, we have been experimenting with simpler and stricter models, which we describe in the rest of this log post.

Moving forward

The first model we experimented with is the Dirichlet Multinomial Mixture, which is a fully Bayesian version of the mixture-of-unigrams model. This model employs a “harder” clustering assignment in which a document is assigned to a single document (in the same way as the GMM), and has been shown to provide good allocations with shorter documents and a tendency to allocate to a minimal number of clusters, in a way similar to the “rich get richer” property seen in nonparametric models [4]. Our preliminary results with the DMM show highly stable assignments and convergence in fewer iterations (an unexpected performance plus), at the cost of a less interpretable model that leads to a higher number of topics.
The second model we are currently experimenting with is our own fully Bayesian Bernoulli Mixture Model (BBMM). Here, we take a step back and observe that a majority of transactions in our datasets have a simple yes/no structure with respect to items, in traditional market-basket fashion—that is, items in a transaction are rarely present in multiple quantities. Moreover, we cluster transactions separately regardless of the client they are assigned to. Our preliminary results show stable and more separable assignments than in LDA and DMM, convergence in a rather small number of iterations (similar to those of DMM), but a more interpretable model with a smaller number of topics than DMM. However, training performance –measured in running time per iteration– is significantly higher in the BBMM, primarily because the assignment has to be done on each transaction, and not a per-client aggregate of transactions. 
For example, one of our datasets is comprised of about 25 million transactions, assigned do approximately 1 million customers. Both LDA and DMM do a pre-processing of the dataset so that in the end only 1 million “documents” have to be assigned to clusters. However, the BBMM is forced to work with 25 million “documents” separately. This implies a significant performance impact. As a result, we have directed our efforts at optimizing the inference procedure of this model. 

References

[1] H. Hruschka. “Linking Multi-Category Purchases to Latent Activities of Shoppers: Analysing Market Baskets by Topic Models”. University of Regensburg Working Papers in Business, Economics and Management Information Systems (2014).
[2] Cadez, I, et al. “Predictive Profiles for Transaction Data using Finite Mixture Models”. Technical Report No. 01-67, University of California, Irvine (2001)
[3] Hong, L. and Davidson, B. “Empirical Study of Topic Modeling in Twitter”. SOMA 2010.
[4] Yin, J. and Wang, J. “A Dirichlet Multinomial Mixture Model-based Approach for Short Text Clustering”. KDD 2014.