ISSN: 2056-3736 (Online Version) | 2056-3728 (Print Version)

Statistical Industry Classification

Zura Kakushadze and Willie Yu

Correspondence: Zura Kakushadze ,

Quantigic Solutions LLC, USA

pdf (389.1 Kb) | doi:


We give complete algorithms and source code for constructing (multilevel) statistical industry classifications, including methods for fixing the number of clusters at each level (and the number of levels). Under the hood there are clustering algorithms (e.g., k-means). However, what should we cluster? Correlations? Returns? The answer turns out to be neither and our backtests suggest that these details make a sizable difference. We also give an algorithm and source code for building "hybrid" industry classifications by improving off-the-shelf "fundamental" industry classifications by applying our statistical industry classification methods to them. The presentation is intended to be pedagogical and geared toward practical applications in quantitative trading.


  Ιndustry classification, clustering, cluster numbers, machine learning, statistical risk models, industry risk factors, optimization, regression, mean-reversion, correlation matrix, factor loadings, principal components, hierarchical agglomerative clustering, k-means, statistical methods, multilevel.


Bai, J. and Ng, S. (2002) Determining the number of factors in approximate factor models. Econometrica 70(1): 191-221.

Bouchaud, J.-P. and Potters, M. (2011) Financial applications of random matrix theory: a short review. In: Akemann, G., Baik, J. and Di Francesco, P. (eds.) The Oxford Handbook of Random Matrix Theory. Oxford, United Kingdom: Oxford University Press.Campbell, L.L. (1960) Minimum coecient rate for stationary random processes. Information and Control 3(4): 360-371.Connor, G. and Korajczyk, R.A. (1993) A Test for the Number of Factors in an Approximate Factor Model. The Journal of Finance 48(4): 1263-1291.De Amorim, R.C. and Hennig, C. (2015) Recovering the number of clusters in data sets with noise features using feature rescaling factors. Information Sciences 324: 126-145.Forgy, E.W. (1965) Cluster analysis of multivariate data: eciency versus in-terpretability of classi cations. Biometrics 21(3): 768-769.Goutte, C., Hansen, L.K., Liptrot, M.G. and Rostrup, E. (2001) Feature-Space Clustering for fMRI Meta-Analysis. Human Brain Mapping 13(3): 165-183.Grinold, R.C. and Kahn, R.N. (2000) Active Portfolio Management. New York, NY: McGraw-Hill.Hartigan, J.A. (1975) Clustering algorithms. New York, NY: John Wiley & Sons, Inc.Hartigan, J.A. and Wong, M.A. (1979) Algorithm AS 136: A K-Means Clustering Algorithm. Journal of the Royal Statistical Society, Series C (Applied Statistics) 28(1): 100-108.Kakushadze, Z. (2015a) Mean-Reversion and Optimization. Journal of Asset Management 16(1): 14-40.Available online:, Z. (2015b) Heterotic Risk Models. Wilmott Magazine 2015(80): 40-55. Available online:, Z. and Yu, W. (2016a) Multifactor Risk Models and Heterotic CAPM. The Journal of Investment Strategies 5(4) (forthcoming). Available online:, Z. and Yu, W. (2016b) Statistical Risk Models. The Journal of Investment Strategies (forthcoming). Available online:

Kakushadze, Z. and Yu, W. (2016c) How to Combine a Billion Alphas. Journal of Asset Management (forthcoming).Available online:, R, Ortiz, M.C., Sarabia, L.A. and Sanchez, M.S. (2004) Selecting Variables for k-Means Cluster Analysis by Using a Genetic Algorithm that Optimises the Silhouettes. Analytica Chimica Acta 515(1): 87-100.Lloyd, S.P. (1957) Least square quantization in PCM. Working Paper. Bell Telephone Laboratories, Murray Hill, NJ.Lloyd, S.P. (1982) Least square quantization in PCM. IEEE Transactions on Information Theory 28(2): 129-137.MacQueen, J.B. (1967) Some Methods for classi cation and Analysis of Multivariate Observations. In: LeCam, L. and Neyman, J. (eds.) Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability. Berkeley, CA: University of California Press, pp. 281-297.Murtagh, F. and Contreras, P. (2011) Algorithms for hierarchical clustering: An overview. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 2(1): 86-97.Rousseeuw, P.J. (1987) Silhouettes: a Graphical Aid to the Interpretation and Validation of Cluster Analysis. Journal of Computational and Applied Mathematics 20(1): 53-65.Roy, O. and Vetterli, M. (2007) The e ective rank: A measure of e ective dimensionality. In: European Signal Processing Conference (EUSIPCO). Poznan, Poland (September 3-7, 2007), pp. 606-610.Sharpe, W.F. (1994) The Sharpe Ratio. The Journal of Portfolio Management 21(1): 49-58.Sibson, R. (1973) SLINK: an optimally ecient algorithm for the single-link cluster method. The Computer Journal (British Computer Society) 16(1): 30-34.Steinhaus, H. (1957) Sur la division des corps materiels en parties. Bull. Acad. Polon. Sci. 4(12): 801-804.Sugar, C.A. and James, G.M. (2003) Finding the number of clusters in a data set: An information theoretic approach. Journal of the American Statistical Association 98(463): 750-763.Yang, W., Gibson, J.D. and He, T. (2005) Coecient rate and lossy source coding. IEEE Transactions on Information Theory 51(1): 381-386.