CRAN Task View: Cluster Analysis & Finite Mixture Models

This CRAN Task View contains a list of packages that can be used for finding groups in data and modeling unobserved cross-sectional heterogeneity. Many packages provide functionality for more than one of the topics listed below, the section headings are mainly meant as quick starting points rather than an ultimate categorization. Except for packages stats and cluster (which ship with base R and hence are part of every R installation), each package is listed only once.

Most of the packages listed in this CRAN Task View, but not all are distributed under the GPL. Please have a look at the DESCRIPTION file of each package to check under which license it is distributed.

Hierarchical Clustering:

Functions hclust() from package stats and agnes() from cluster are the primary functions for agglomerative hierarchical clustering, function diana() can be used for divisive hierarchical clustering. Faster alternatives to hclust() are provided by the packages fastcluster and flashClust.
Function dendrogram() from stats and associated methods can be used for improved visualization for cluster dendrograms.
Package dynamicTreeCut contains methods for detection of clusters in hierarchical clustering dendrograms.
hybridHclust implements hybrid hierarchical clustering via mutual clusters.
Package isopam uses an algorithm which is based on the classification of ordination scores from isometric feature mapping. The classification is performed either as a hierarchical, divisive method or as non-hierarchical partitioning.
The package protoclust implements a form of hierarchical clustering that associates a prototypical element with each interior node of the dendrogram. Using the package's plot() function, one can produce dendrograms that are prototype-labeled and are therefore easier to interpret.
pvclust is a package for assessing the uncertainty in hierarchical cluster analysis. It provides approximately unbiased p-values as well as bootstrap p-values.
Package sparcl provides clustering for a set of n observations when p variables are available, where p >> n . It adaptively chooses a set of variables to use in clustering the observations. Sparse K-means clustering and sparse hierarchical clustering are implemented.

Partitioning Clustering:

Function kmeans() from package stats provides several algorithms for computing partitions with respect to Euclidean distance.
Function pam() from package cluster implements partitioning around medoids and can work with arbitrary distances. Function clara() is a wrapper to pam() for larger data sets. Silhouette plots and spanning ellipses can be used for visualization.
Package apcluster implements Frey's and Dueck's Affinity Propagation clustering. The algorithms in the package are analogous to the Matlab code published by Frey and Dueck.
Package bayesclust allows to test and search for clusters in a hierarchical Bayes model.
Package clusterSim allows to search for the optimal clustering procedure for a given dataset.
Package flexclust provides k-centroid cluster algorithms for arbitrary distance measures, hard competitive learning, neural gas and QT clustering. Neighborhood graphs and image plots of partitions are available for visualization. Some of this functionality is also provided by package cclust.
Package kernlab provides a weighted kernel version of the k-means algorithm by kkmeans and spectral clustering by specc.
Packages kml and kml3d provide k-means clustering specifically for longitudinal (joint) data.
Package skmeans allows spherical k-Means Clustering, i.e. k-means clustering with cosine similarity. It features several methods, including a genetic and a simple fixed-point algorithm and an interface to the CLUTO vcluster program for clustering high-dimensional datasets.
Package trimcluster provides trimmed k-means clustering.

Model-Based Clustering:

ML estimation:
- For semi- or partially supervised problems, where for a part of the observations labels are given with certainty or with some probability, package bgmm provides belief-based and soft-label mixture modeling for mixtures of Gaussians with the EM algorithm.
- EMCluster provides EM algorithms and several efficient initialization methods for model-based clustering of finite mixture Gaussian distribution with unstructured dispersion in unsupervised as well as semi-supervised learning situation.
- Package FisherEM is a subspace clustering method which allows for efficient unsupervised classification of high-dimensional data. It is based on the Gaussian mixture model and on the idea that the data lives in a common and low dimensional subspace. An EM-like algorithm estimates both the discriminative subspace and the parameters of the mixture model.
- Package HDclassif provides function hddc to fit Gaussian mixture model to high-dimensional data where it is assumed that the data lives in a lower dimension than the original space.
- Package HMMmix allows to fit mixtures of Gaussians with a hidden Markov model for the latent component memberships of the clusters which might be derived by combining components of the mixture.
- Package teigen allows to fit multivariate t-distribution mixture models (with eigen-decomposed covariance structure) from a clustering or classification point of view. Package longclust allows to fit these models as well as Gaussian mixture models to longitudinal data.
- Package mclust fits mixtures of Gaussians using the EM algorithm. It allows fine control of volume and shape of covariance matrices and agglomerative hierarchical clustering based on maximum likelihood. It provides comprehensive strategies using hierarchical clustering, EM and the Bayesian Information Criterion (BIC) for clustering, density estimation, and discriminant analysis. Package Rmixmod provides tools for fitting mixture models of multivariate Gaussian or multinomial components to a given data set with either a clustering, a density estimation or a discriminant analysis point of view. mclust provides only 10 of the 14 possible variance-covariance structures based on the eigenvalue decomposition. All 14 variants are provided by packages mixture and Rmixmod.
- Package MetabolAnalyze fits mixtures of probabilistic principal component analysis with the EM algorithm.
- For grouped conditional data package mixdist can be used.
- Fitting finite mixtures of uni- and multivariate scale mixtures of skew-normal distributions with the EM algorithm is provided by package mixsmsn.
- Package movMF fits finite mixtures of von Mises-Fisher distributions with the EM algorithm.
- Package MFDA implements model-based functional data analysis.
- Package GLDEX fits mixtures of generalized lambda distributions and for grouped conditional data package mixdist can be used.
- mritc provides tools for classification using normal mixture models and (higher resolution) hidden Markov normal mixture models fitted by various methods.
- Parsimonious Gaussian mixture models allow to fit mixtures of factor analyzers with a constraints on the components of the factor models. Functionality to fit these models is provided in package pgmm.
- prabclus clusters a presence-absence matrix object by calculating an MDS from the distances, and applying maximum likelihood Gaussian mixtures clustering to the MDS points.
- Package psychomix estimates mixtures of the dichotomous Rasch model (via conditional ML) and the Bradley-Terry model. Package mixRasch estimates mixture Rasch models, including the dichotomous Rasch model, the rating scale model, and the partial credit model with joint maximum likelihood estimation.
- Package pmclust allows to use unsupervised model-based clustering for high dimensional (ultra) large data. The package uses pbdMPI to perform a parallel version of the EM algorithm for mixtures of Gaussians.
Bayesian estimation:
- Bayesian estimation of finite mixtures of multivariate Gaussians is possible using package bayesm. The package provides functionality for sampling from such a mixture as well as estimating the model using Gibbs sampling. Additional functionality for analyzing the MCMC chains is available for averaging the moments over MCMC draws, for determining the marginal densities, for clustering observations and for plotting the uni- and bivariate marginal densities.
- Package bayesMCClust provides various Markov Chain Monte Carlo samplers for model-based clustering of discrete-valued time series obtained by observing a categorical variable with several states using a Bayesian approach.
- Package bayesmix provides Bayesian estimation using JAGS.
- Package bclust allows Bayesian clustering using a spike-and-slab hierarchical model and is suitable for clustering high-dimensional data.
- Package Bmix provides Bayesian Sampling for stick-breaking mixtures.
- Package dpmixsim fits Dirichlet process mixture models using conjugate models with normal structure. Package profdpm determines the maximum posterior estimate for product partition models where the Dirichlet process mixture is a specific case in the class.
- Package mixAK contains a mixture of statistical methods including the MCMC methods to analyze normal mixtures with possibly censored data.
- Package GSM fits mixtures of gamma distributions.
- Package mcclust implements methods for processing a sample of (hard) clusterings, e.g. the MCMC output of a Bayesian clustering model. Among them are methods that find a single best clustering to represent the sample, which are based on the posterior similarity matrix or a relabeling algorithm.
- Package rjags provides an interface to the JAGS MCMC library which includes a module for mixture modelling.
Other estimation methods:
- Package AdMit allows to fit an adaptive mixture of Student-t distributions to approximate a target density through its kernel function.
- Package pendensity estimates densities with a penalized mixture approach.
- Robust estimation using Weighted Likelihood can be done with package wle.

Other Cluster Algorithms:

Package amap provides alternative implementations of k-means and agglomerative hierarchical clustering.
Package biclust provides several algorithms to find biclusters in two-dimensional data.
Package cba implements clustering techniques for business analytics like "rock" and "proximus".
Package CHsharp clusters 3-dimensional data into their local modes based on a convergent form of Choi and Hall's (1999) data sharpening method.
Package clue implements ensemble methods for both hierarchical and partitioning cluster methods.
Package CoClust implements a cluster algorithm that is based on copula functions and therefore allows to group observations according to the multivariate dependence structure of the generating process without any assumptions on the margins.
Fuzzy clustering and bagged clustering are available in package e1071.
Package compHclust provides complimentary hierarchical clustering which was especially designed for microarray data to uncover structures present in the data that arise from 'weak' genes.
Package FactoClass performs a combination of factorial methods and cluster analysis.
The hopach algorithm is a hybrid between hierarchical methods and PAM and builds a tree by recursively partitioning a data set.
For graphs and networks model-based clustering approaches are implemented in packages latentnet and mixer.
Package nnclust allows fast clustering of large data sets by constructing a minimum spanning tree for each cluster. For each cluster the procedure is stopped when the nearest-neighbor distance rises above a specified threshold. A set of clusters and a set of "outliers" not in any cluster is returned. The algorithm works best for well-separated clusters in up to 8 dimensions, and sample sizes up to hundreds of thousands.
Package optpart contains a set of algorithms for creating partitions and coverings of objects largely based on operations on similarity relations (or matrices).
Package pdfCluster provides tools to perform cluster analysis via kernel density estimation. Clusters are associated to the maximally connected components with estimated density above a threshold. In addition a tree structure associated with the connected components is obtained.
Package randomLCA provides the fitting of latent class models which optionally also include a random effect. Package poLCA allows for polytomous variable latent class analysis and regression. BayesLCA allows to fit Bayesian LCA models employing the EM algorithm, Gibbs sampling or variational Bayes methods.
Package RPMM fits recursively partitioned mixture models for Beta and Gaussian Mixtures. This is a model-based clustering algorithm that returns a hierarchy of classes, similar to hierarchical clustering, but also similar to finite mixture models.
Self-organizing maps are available in package som.
Several packages provide cluster algorithms which have been developed for bioinformatics applications. These packages include FunCluster for profiling microarray expression data and ORIClust for order-restricted information-based clustering.

Cluster-wise Regression:

Package flexmix implements an user-extensible framework for EM-estimation of mixtures of regression models, including mixtures of (generalized) linear models.
Package fpc provides fixed-point methods both for model-based clustering and linear regression. A collection of asymmetric projection methods can be used to plot various aspects of a clustering.
Multigroup mixtures of latent Markov models on mixed categorical and continuous data (including time series) can be fitted using depmix or depmixS4. The parameters are optimized using a general purpose optimization routine given linear and nonlinear constraints on the parameters.
Package mixreg fits mixtures of one-variable regressions and provides the bootstrap test for the number of components.
Package lcmm fits a latent class linear mixed model which is also known as growth mixture model or heterogeneous linear mixed model using a maximum likelihood method.
mixtools provides fitting with the EM algorithm for parametric and non-parametric (multivariate) mixtures. Parametric mixtures include mixtures of multinomials, multivariate normals, normals with repeated measures, Poisson regressions and Gaussian regressions (with random effects). Non-parametric mixtures include the univariate semi-parametric case where symmetry is imposed for identifiability and multivariate non-parametric mixtures with conditional independent assumption. In addition fitting mixtures of Gaussian regressions with the Metropolis-Hastings algorithm is available.
mixPHM fits mixtures of proportional hazard models with the EM algorithm.
Package gamlss.mx fits finite mixtures of gamlss family distributions.

Additional Functionality:

Mixtures of univariate normal distributions can be printed and plotted using package nor1mix.
Package gcExplorer allows to visualize the results of clustering algorithms.
Package clusterGeneration contains functions for generating random clusters and random covariance/correlation matrices, calculating a separation index (data and population version) for pairs of clusters or cluster distributions, and 1-D and 2-D projection plots to visualize clusters. Alternatively MixSim generates a finite mixture model with Gaussian components for prespecified levels of maximum and/or average overlaps. This model can be used to simulate data for studying the performance of cluster algorithms.
For cluster validation package clusterRepro tests the reproducibility of a cluster. Package clv contains popular internal and external cluster validation methods ready to use for most of the outputs produced by functions from package cluster and clValid calculates several stability measures.
Package clustvarsel provides variable selection for model-based clustering.
Functionality to compare the similarity between two cluster solutions is provided by cluster.stats() in package fpc.
The stability of k-centroid clustering solutions fitted using functions from package flexclust can also be validated via bootFlexclust() using bootstrap methods.
Package MOCCA provides methods to analyze cluster alternatives based on multi-objective optimization of cluster validation indices.
Package seriation provides dissplot() for visualizing dissimilarity matrices using seriation and matrix shading. This also allows to inspect cluster quality by restricting objects belonging to the same cluster to be displayed in consecutive order.
Package sigclust provides a statistical method for testing the significance of clustering results.

Maintainer:	Friedrich Leisch and Bettina Gruen
Contact:	Bettina.Gruen at jku.at
Version:	2014-02-21

CRAN Task View: Cluster Analysis & Finite Mixture Models

CRAN packages:

Related links: