The TWOSTEP procedure in SPSS Statistics uses a different algorithm and uses either the Bayesian Information Criterion (BIC) or the Akaike Information Criterion (AIC) for determining the number of clusters. In this procedure, STATS MIXED CLUSTER, prediction strength is used. The key idea (Tibshirani and Walther) is to view clustering as a supervised classification problem, in which we must also estimate the “true” class labels. The resulting “prediction strength” measure assesses how many groups can be predicted from the data, and how well.
PS tends to produce a smaller number of clusters than BIC or AIC. Since clustering is an ad hoc procedure, there isn’t in general a way to determine the best algorithm, so it may be useful to try both this procedure and TWOSTEP. Since clustering is an ad hoc procedure, there isn’t in general a way to determine the best algorithm, so it may be useful to try both this procedure and TWOSTEP. This procedure does not provide test statistics or variable importance measures for the cluster, because test statistics would be biased. Use summary statistics by cluster to see what variables have different distributions aacross the calculated clusters.
STATS MIXED CLUSTER
VARIABLES = list of variables to cluster*
USEMODELFILE = "file specification"
USEMEMORYMODEL = NO** or YES
IDVAR = variable name*
OUTMODEL = "file specification"
MERGEDVARNAME = variable name*
/OPTIONS
NUMCLUST= number of clusters or minimum number*
NUMCLUSTEND = maximm number of clusters
NUMCLUSTINCR = increment for number of clusters
STANDARDIZE = NO** or YES
NUMINIT = number of intializations*
WEIGHTS = list of importance weights
MAXITER = maximum number of intializations
NUMPREDSTRCVRUN = number of cross validation runs
PREDCVTHRESH = prediction strength threshold
/HELP
STATS MIXED CLUSTER /HELP displays this information and does nothing else.
* Required
** Default
STATS MIXED CLUSTER VARIABLES = salary salbegin jobtime jobcat gender
/OPTIONS NUMCLUST = 2 NUMCLUSTEND = 6 NUMCLUSTINCR = 2 NUMINIT=5 MAXITER=25 NUMPREDSTRCVRUN=2
PREDSTRTHRESH = .8.
This procedure does not support case weights. They will be ignored with a warning. It also does not support split files.
VARIABLES Specify the list of variables for clustering. Both scale and categorical variables must appear. Cases with missing values are excluded listwise.
USEMODELFILE specifies a model file saved from a previous run. If this or USEMEMORYMODEL is specified, omit the VARIABLES, IDVAR, OUTMODEL, and OPTIONS keywords. If neither this nor USEMEMORYMODEL is specified, a model is estimated and used for classifying the input cases.
USEMEMORYMODEL YES means not to estimate a model but use the model in memory as long as it is still present. Ending the SPSS run or clearing the R workspace with a BEGIN PROGRAM R block will discard the model.
IDVAR identifies the ID variable to be used. Omit if using a saved or in-memory model, The IDVAR is needed, because if there is sample selection or there are missing values, the ID variable determines how the calculated clusters are added to the active file.
OUTMODEL specifies a file to hold the estimated model for future use. Omit if the command is reusing a saved or in-memory model.
MERGEDVARNAME Specifies a variable to hold the assigned cluster numbers. This variable can be different from the estimation cluster variable for applying an existing model to new data. If this variable exists in the active dataset, it will be overwritten assuming that it is numeric. Cases excluded by sample selection will retain their previous cluster values.
NUMCLUST, NUMCLUSTEND, and NUMCLUSTINCR specify the number of clusters. If just NUMCLUST is specified, that number of clusters is estimated. If both NUMCLUST and NUMCLUSTEND are specified, all integers in that range are estimated and the best one is chosen, but if NUMCLUSTINCR is not 1, which is the default, that increment is used to step through the numbers. For example, with 2,6, 2, clusters with 2, 4, and 6 would be estimated.
If there is more than one cluster number to estimate, the prediction strength is calculated for each and used to determine the best number. NUMSTRCVRUN and PREDSTRTHRESH are required in that case.
NUMPREDSTRCVRUN specifies how many cross-vfalidation runs to use with a minimum of 2 if more than one cluster is to be estimated.
PREDSTRTHRESH specifies the prediction threshold. Typical values would be .8 or .9.
Prediction strength uses all but the last cross validation as a training sample and the last as the test. The optimal number of clusters is the largest number that can be reliably predicted in the dataset. For each test cluster, the procedure computes the proportion of observation pairs in that cluster that are also assigned to the same cluster by the training set centroids. The prediction strength is the minimum of this quantity over the k test clusters. See Section 2 in Tibshirani and Walther for the details.
If the prediction strength measure is below the threshold for a number of clusters, that number of clusters is not considered in choosing the optimal number, but if all cluster numbers give a value below the threshold, the least bad number of clusters is chosen. Tibshirani and Walther recommend a value of .8 to .9 for PS, but other authors have suggested a lower value if cluster overlap is expected. The lower the threshold, the greater the number of clusters is likely to be. See the references for more information.
STANDARDIZE Choose YES to standardize the scale variables to mean zero and variance one. An alternative would be to use importance weights.
NUMINIT With k-means clustering, the resulting clusters may vary depending on the initial cluster choices. Specify a value greater than one to try more than one set of initializations. For the scale variables, random draws from the observed continuous data vectors are used while for categorical variables, a draw from a Dirichlet distribution with shape parameters all equal to one is used.
WEIGHTS By default, all variables are weighted equally (weight 1) in computing the clusters, but you can specify a different weight for each variable. In that case, as many weights as variables must be specified, and the weights must be entered in the order in the Cluster Variables field. Weight values must be between 0 and 1, with 0 meaning that the variable is ignored.
Scale variable weights are applied during the calculation of Euclidean distance, and categorical variable weights are applied to the log-likelihoods obtained by the level probabilities given cluster membership. The total log likelihood for the kth cluster is obtained by weighting the single scale log-likelihood by the mean of all scale weights plus the logLikelihood for categorical variables.
MAXITER specifies the number of iterations for the calculation which defaults to 25.
This procedure uses the Kamila R package by Alexander H. Foss and Marianthi Markatou. If Kamila changes in significant ways, this procedure might also change.Kamila Reference Manual: Kamila
Tibshirani R, Walther G (2005). “Cluster Validation by Prediction Strength.” Journal of Computational and Graphical Statistics, 14(3), 511–528. doi:10.1198/106186005x59243. Cluster Validation
© Copyright Jon K Peck 2025