Supervised machine learning techniques have become well established in the study of spectroscopy data. However, the unsupervised learning technique of cluster analysis hasn’t reached the same level maturity in chemometric analysis. This paper surveys recent studies which apply cluster analysis to NIR and IR spectroscopy data. In addition, we summarize the current practices in cluster analysis of spectroscopy and contrast these with cluster analysis literature from the machine learning and pattern recognition domain. This includes practices in data pre-processing, feature extraction, clustering distance metrics, clustering algorithms and validation techniques. Special consideration is given to the specific characteristics of IR and NIR spectroscopy data which typically includes high dimensionality and relatively low sample size. The findings highlighted a lack of quantitative analysis and evaluation in current practices for cluster analysis of IR and NIR spectroscopy data. With this in mind, we propose an analysis model or workflow with techniques specifically suited for cluster analysis of IR and NIR spectroscopy data along with a pragmatic application strategy.

In the study of IR and NIR spectroscopy in the field of chemometrics, there is a well-established range of multivariate analysis techniques based on machine learning that have proved well suited to the chemical spectroscopy data [

Cluster analysis is a technique that offers potential value for scenarios in the analysis of spectroscopy but has not reached the same level of maturity in its application to this domain. Cluster analysis is an unsupervised machine learning technique aimed at generating knowledge from unlabelled data [

While cluster analysis is a well-established domain and widely used across diverse disciplines, it would be wrong to assume its application would be clear-cut and simply procedural. It is a highly subjective domain, with many potential techniques whose success will vary depending on the characteristics of the data and the purpose of the analysis. Clustering is very much a human construct, hence, mathematical definitions are challenging and even the definition of good clustering is subjective [

What features should be used for clustering?

How is similarity defined and measured?

How many clusters are present?

Which clustering algorithms should be used?

Does the data actually have any clustering tendency?

Are the discovered clusters valid?

These challenges and the data specific characteristics of clustering contribute to the reason why there is no universal “best” clustering algorithm [

In this paper, we quantitatively survey 50 papers where cluster analysis is applied to IR and NIR spectroscopy data to understand current practice in this form of analysis. In reviewing the current approaches in clustering IR and NIR spectroscopy data, consideration and commentary is given to highlight potential issues in current practice.

We also draw on more than 25 papers and texts we have cited from the machine learning associated domains to identify techniques that could contribute towards an improved future practice in cluster analysis of spectroscopy. Special consideration is given to two important characteristics of the spectroscopy data:

High Dimensionality: A large number of measurements are taken at intervals across a spectrum for each sample. From the data analysis perspective, these form the variables or features. Depending on the type of spectroscopy and the specifics of the instrumentation, the number of features is typically in the hundreds or thousands for each sample. In other cluster analysis literature [

Low sample size: Spectroscopy and the associated instrumentation are typically used in laboratory situations. Collecting and processing samples can be an expensive process from the perspective of cost, time, and expertise. Hence, the number of samples is often relatively small, particularly from a machine learning perspective. This precludes the use of some cutting-edge cluster analysis techniques such as clustering with deep neural networks (deep clustering).

These characteristics present unique challenges that focus and somewhat limit the techniques suitable for cluster analysis of spectroscopy data. Hence, this paper presents a novel perspective specific to the needs of cluster analysis in IR and NIR spectroscopy while drawing on strong practices from the machine learning community. This culminates in a proposed analysis model or workflow to assist practitioners in ensuring rigor and validity in their cluster analysis.

An exhaustive review was conducted to collate 50 papers published between 2002 and 2020 where a form of cluster analysis is applied to data from IR and NIR spectroscopy data. 44 journal papers [

The papers surveyed cover a range of application domains including food and agriculture (30 papers) [

The purpose of the majority of the papers is to demonstrate that an analytical testing technique, such as FTIR spectroscopy, paired with cluster analysis could discriminate between different classes of materials. Examples of these classes include cancerous and non-cancerous cells, provenancing of biological species such as tea varieties, fungi or bacteria, and contaminated materials such as counterfeit drugs or adulterated olive oil. Many of the papers subsequently extended this capability beyond cluster analysis through the application of other techniques such as supervised learning to develop classification models for classification of future unlabelled samples. Thirteen of the papers included comparison of multiple clustering techniques [

Several of the papers include analysis of additional types of analytical techniques such as Raman spectroscopy or gas chromatography-mass spectrometry (GC-MS), however, we will only include the NIR and IR aspects of those papers in this review.

In reviewing the 50 surveyed papers to understand the current state of cluster analysis of IR and NIR spectroscopy data, we focus on the aspects of analysis that we consider important for successful cluster analysis. Firstly, the traditional chemometrics aspects of pre-processing, feature selection, and principal component analysis are reviewed. Then the cluster analysis aspects are reviewed. In this part, we include techniques covering each of these aspects which are found in the classic and contemporary pattern recognition and machine learning literature but may not have been considered in the chemometrics literature. These include evaluating the data’s tendency to cluster, the similarity measure used for the clustering, the clustering algorithm itself, how the number of clusters were selected, and how the results were evaluated and quantified. For each aspect, justification for including this step in the analysis is presented and the potential pitfalls of not including it. This is then compared to the analysis within the surveyed papers to understand current practice and to highlight potential shortcomings.

In reflecting on these findings, potential reasons for this current practice are discussed. Finally, a proposed analysis model or workflow is presented for clustering of NIR and IR spectroscopy data that aims to ensure rigor and validity for future practitioners conducting cluster analysis.

Initially, we review the early steps in the analysis process where traditional chemometric techniques are applied before the cluster analysis. The aim of these traditional chemometric analysis stages is to improve the suitability of the data for clustering, hence improving the clustering outcomes. These include data pre-processing, feature selection, and principal component analysis. While the primary focus of this paper is on the cluster analysis, these traditional chemometric analysis components are crucial to the clustering outcomes and warrant investigation.

Data pre-processing methods are used to remove or reduce unwanted signals from data such as instrumental and experimental artefacts prior to cluster analysis. If not performed in the right way, pre-processing can also introduce or emphasize unwanted variation. Hence, proper pre-processing is a critical first step that directly influences the follow-on analysis in the workflow [

In reviewing the surveyed papers (summarized in

Pre-processing technique | Instances |
---|---|

Normalisation (scale & centre) [ |
12 |

Baseline correction [ |
8 |

Unit area normalisation [ |
1 |

Vector normalisation [ |
13 |

Savitzky–Golay smoothing [ |
22 |

1st derivative [ |
15 |

2nd derivative [ |
12 |

Standard normal variate (SNV) [ |
5 |

Multiplicative scatter correction (MSC) [ |
7 |

Extended MSC (EMSC) [ |
2 |

RMieS-EMSC [ |
2 |

No pre-processing [ |
12 |

Lee et al. [

It was also noted that 12 papers did not use any pre-processing and some explicitly stated that they were choosing to use no pre-processing, without giving a justified reason. This is generally discouraged as it forgoes the opportunity to correct the data for variations in equipment and measurement technique that may adversely impact the success of the later cluster analysis.

Feature selection, also known as variable selection or variable reduction, refers to selection of the useful variable that convey the relevant information within the data, and removal of those that may include noise or non-valuable information. Within NIR and IR spectroscopy data, the wavenumbers (or wavelengths) are the variables (or features). Hence, feature selection works to remove wavenumbers containing irrelevant data or noise from the dataset. This works to reduce the dimensionality of the data and focus on the information of value. In one of the surveyed papers, Gierlinger et al. [

The summary of feature selection approaches from the 50 surveyed papers is shown in

Feature selection approach | Instances |
---|---|

A priori knowledge [ |
15 |

Visual spectra evaluation [ |
9 |

Quantitative selection techniques [ |
6 |

Full spectrum used [ |
24 |

15 of the papers selected windows in the spectra based on a priori knowledge. This was typically knowledge of where in the spectra the “fingerprint” wavenumbers were to separate the spectra of materials they were investigating.

Nine of the papers selected windows of the spectra through visual evaluation of labelled spectra to see at what wavenumbers there was the maximum separation between the different samples’ spectra.

Only six of the papers used quantitative techniques for feature selection. One used a novel method based on an iterative variable elimination algorithm and a clustering quality index to select variables that maximize clustering quality [

With only six papers using quantitative methods, this highlights an opportunity to exploit techniques from the machine learning research domain. Within the machine learning community, feature selection is a significant domain of research. However, it is predominantly focused on supervised learning techniques which may not be applicable to unsupervised cluster analysis. Hence, care must be taken when choosing techniques to implement. The challenges of unsupervised feature selection are well explained by Dy et al. [

PCA is one of the classic dimension reduction techniques of chemometrics and was used in the majority of the surveyed papers. Its dimension reducing capabilities can be used for multiple purposes. One particularly applicable to cluster analysis is to reduce the data to two or three principal components to enable visualization of the data points in two or three dimensions. This enables easy visualization of the clusters that form and visual validation of the clustering. As shown in

PCA usage | Instances |
---|---|

None [ |
15 |

Dimension reduction [ |
19 |

Visualization [ |
21 |

Variable/Feature selection [ |
3 |

Outlier removal [ |
2 |

19 of the papers used PCA for its general dimension reducing capabilities. Applying PCA can dramatically reduce the number of dimensions in IR or NIR spectroscopy data while still retaining a high percentage of the information. This is effectively a form of feature extraction where the principal components from the PCA form the new variables. It was commonly observed for the typical 3500 dimensions (variables) in FTIR data to be reduced to 10 to 14 principal components while still retaining more than 99% of the original information. While this can speed analysis times and is an enabler for other analytical processes such as

Of note, t-SNE (t-Distributed Stochastic Neighbor Embedding) [

The cluster analysis techniques used in the 50 surveyed IR and NIR analysis papers are now evaluated. In the domain of cluster analysis, there are common steps documented across the cited machine learning references that are typically applied to ensure validity and confidence in the outcomes of the cluster analysis. These form the sections of the following review.

As a starting point before any clustering is conducted, it is prudent to evaluate the data’s

In reviewing the surveyed papers on cluster analysis of IR and NIR spectroscopy data, only one of the papers assessed their data’s clustering tendency. Zhang et al. [

Reasons for the lack of clustering tendency testing within the other papers may include the often simplistic and self-validating nature of the clustering that is being applied within many of the surveyed papers. Typically, the subjects being clustered were known groupings of materials such as different varieties of tea. Hence, clusterability may have been assumed and validated when cluster analysis delivered the expected results and correct clustering.

To have high confidence in the results of the cluster analysis and remove the possibility of delivering correct results by random chance, we recommend that a clustering tendency test is conducted. As with most aspects of clustering, there are multiple potential tests for clustering tendency and their effectiveness can be influenced by the characteristics of the data. Common techniques include the Dip test [

Since the goal of clustering is to identify clusters of objects that are similar, some measure of similarity is required. The similarity measure defines how the similarity of two elements is calculated. Similarity measure may be also referred to as a distance measure, although similarity measures can include correlation-based metrics.

Within the papers surveyed, Euclidean distance was the most common metric used for comparing similarity, followed by Pearson’s correlation coefficient (

Chosen measure | Instances |
---|---|

Euclidean distance [ |
30 |

Pearson’s correlation coefficient [ |
6 |

Mahalanobis distance [ |
2 |

Weighted inner product induced (fuzzy) distance [ |
1 |

None described [ |
13 |

Numerous clustering algorithms have been proposed in the literature with new clustering algorithms continuing to appear. However, clustering algorithms can generally be divided into two forms; hierarchical and partitional [

In reviewing the clustering techniques used in the surveyed papers (

Clustering algorithm | Instances |
---|---|

Hierarchical cluster analysis (Ward’s method) [ |
19 |

Hierarchical cluster analysis (average link) [ |
5 |

Hierarchical cluster analysis (single link) [ |
3 |

Hierarchical cluster analysis (complete link) [ |
2 |

Hierarchical cluster analysis (weighted average) [ |
1 |

Hierarchical cluster analysis (median link) [ |
1 |

Hierarchical cluster analysis (centroid link) [ |
1 |

Hierarchical cluster analysis (unspecified) [ |
6 |

K-means [ |
13 |

K-means hybrid particle swarm [ |
1 |

Fuzzy C means [ |
11 |

Allied Gustafson–Kessel [ |
1 |

Gustafson–Kessel [ |
1 |

Possibilistic C-means [ |
2 |

Allied Fuzzy C-means [ |
1 |

Variable string length simulated annealing [ |
1 |

Simulated annealing fuzzy clustering [ |
1 |

Spectral cross correlation analysis [ |
1 |

DBSCAN [ |
1 |

Principal components discriminant function analysis [ |
1 |

Principal component analysis [ |
1 |

Variants of hierarchical clustering algorithms are differentiated by the rules they use to form the links between datapoints and hence, the clusters. Single link, complete link, average link and Ward’s method are four of the most popular [

The fuzzy clustering techniques of Fuzzy C Means, Allied Gustafson-Kessel, Possibilistic C-Means, Allied Fuzzy C-Means, Variable String Length Simulated Annealing, and Simulated Annealing Fuzzy Clustering were the next most common technique group. K-Means clustering was also regularly applied to within the surveyed papers.

Nine of the papers surveyed made comparisons between various clustering techniques. One paper reviewed the linkage techniques for hierarchical clustering, concluding that Ward’s method gave the best results for their application [

Based on these conflicting findings, it is clear that choosing a clustering algorithm for clustering IR and NIR spectroscopy data is not a simple decision. Yet, in reviewing the justifications provided in the papers for their choice of clustering algorithms (

Justification | Instances |
---|---|

None [ |
31 |

To evaluate/compare [ |
14 |

Commonly used [ |
4 |

Best (no citation) [ |
6 |

Best (with citation) [ |
2 |

Suits data/purpose [ |
3 |

In looking to techniques prominent in other machine learning domains, clustering using deep neural networks (deep clustering) is emerging in prominence. As surveyed by Min et al. [

One of the major challenges in cluster analysis is predicting the number of clusters (

In reviewing the clustering techniques used within the 50 NIR and IR spectroscopy papers (

Method | Instances |
---|---|

Not addressed [ |
30 |

A priori knowledge [ |
13 |

Manual adjustment and judgement (qualitative) [ |
4 |

Quantitative analysis [ |
3 |

Of the remaining seven papers surveyed, four used qualitative analysis and three used a quantitative analysis to predict the number of clusters. The qualitative analysis papers visualized the clustering results for various values of

Three common quantitative techniques for predicting the number of clusters include the “Elbow” method, the Gap Statistic, and the use of internal cluster validation indices (such as the Silhouette score method).

In the elbow method, the total within-cluster sum-of-squares variation is calculated and plotted

The Gap Statistic method aims at providing a statistical procedure to formalize the heuristic of the elbow method [

A third technique to predict the number of clusters is through the use of internal cluster validation indices. This was the quantitative approach used in two of the reviewed papers (i.e., Xie-Beni cluster validity measure for fuzzy clustering [

The overall Silhouette score for a set of

Since cluster analysis is an unsupervised learning task, it can be challenging to validate the goodness of the clustering and gain confidence in the clustering results [

There are two main types of validity criteria that can be applied:

External validation was the dominant approach used in the reviewed papers, and it fits the purpose of the majority of papers: demonstrating that IR or NIR testing can correctly separate samples into classes where true labels for the samples are known. As summarized in

Validation method | Instances |
---|---|

Cluster plot visual comparison against true labels [ |
12 |

Dendrogram comparison against true labels [ |
22 |

Image visual comparison against true labels [ |
6 |

Table comparison against true labels [ |
3 |

% Correct against true labels [ |
5 |

Quality Metric (SI, Xie–Beni, etc.) [ |
5 |

When clustering is only partially correct, the task of measuring this level of correctness is less trivial. Concerningly, five of the papers reported their results as a “percentage correct” against known labels. This is a notion that does not match the concept of clustering. The labeling generated from cluster analysis (unsupervised learning) are symbolic and based on similarity, so directly matching them to classification labels ignores a correspondence problem [

As with most aspects of clustering, there are many potential validation indices that have been proposed. Desgraupes et al. [

The V-measure or ‘Validity measure’ is the harmonic mean between the

where a

The Adjusted Rand Index (ARI) [

Internal cluster validation is used where the true labels are not known for evaluation or there is a desire to compare the quality of clustering generated by different clustering techniques [

In reflecting on the findings, we will primarily focus on the clustering aspects of the analysis presented in the surveyed papers. Here, shortcomings were observed (as previously highlighted) that may indicate a lack of familiarity with some of the complexities of clustering practice by some researchers using spectroscopy. These indicators include a lack of clarity in the explanation of the cluster analysis process, missing details such as the type of linkage used in hierarchical cluster analysis or the distance metric used, and cluster validation indices not being used for validation. This uncommonness of cluster validation indices is a significant difference to literature from the machine learning domain where quantified cluster analysis is more prominent.

This is not unexpected as while clustering is certainly not a new field, it is one with challenges, complexities, uncertainties and ambiguities that may not be appreciated by researchers where cluster analysis is not their primary area of research. There is limited conclusive literature available on clustering of spectroscopy data to support practitioners and the choice of best techniques can be dependent on the specific characteristics of the data being analyzed.

An additional potential contributor to the observed shortcomings is the chemometric software that is commonly used in association with IR or NIR data analysis. Many practitioners look for off-the-shelf solutions for their chemometric analysis [

Finally, the applications where clustering is being applied was simplistic in many of the surveyed papers. In an example where the aim of the research is to demonstrate IR or NIR spectroscopy can separate samples into

In order to add rigor to future cluster analysis conducted on IR and NIR spectroscopy data, an analysis model or

At this point, it is worth discussing the depth of analysis conducted at each stage of this analysis model. If full quantitative analysis and evaluation is conducted at each stage of the workflow, it could become a substantial and time-consuming package of analysis. i.e., application and evaluation of multiple pre-processing techniques, application and evaluation of multiple variable or feature selection techniques, PCA analysis, testing for tendency to cluster, application and evaluation of multiple similarity measures, application and evaluation of multiple clustering algorithms, application of quantitative clustering indices to predict the number of clusters, and application of clustering indices to evaluate the final results of the cluster analysis.

A pragmatic approach is suggested. Consideration should be given to the purpose of the analysis and its importance, i.e., early exploratory analysis may not warrant as much effort compared to a conclusive demonstration of a cancer detection technique aimed at wide spread publication. Similarly, consideration should be given to the data itself and the challenge it presents to cluster analysis. If the data can be visually seen to be well separated and sufficiently accurate clustering can be easily achieved, then it may not warrant the evaluation of multiple techniques to achieve improved data and clustering characteristics.

A streamlined approach may be to select common or familiar approaches for data pre-processing, variable selection, similarity measure, and clustering algorithm and then evaluate the results. If sufficiently accurate clustering is achieved with these selected approaches, it may not warrant refinement and evaluation in these areas. It is however recommended that clustering tendency is tested, and the number of clusters is predicted as these are valuable indicators in the confidence of the clustering results and its applicability. Additionally, if this streamlined approach is pursued, we encourage the analysts and authors to be explicit about this approach when publishing their results and to detail why those decisions were made.

If sufficiently accurate clustering is not achieved utilizing this streamlined approach, then that is a driver for more detailed analysis and evaluation at each of the stages of the analysis model with the final clustering indices scores as the metric against which results can be assessed. Similarly, if true labelled data is not available for evaluating the results of the cluster analysis, internal clustering indices will be the metric used for assessing the outcome of the overall analysis.

Of note, this potentially significant volume of analysis will have the most burden the first time the analysis model is implemented. If the workflow can be implemented in an analysis environment such as MATLAB, R or python, the time required for subsequent applications of this analysis model will be significantly less. Hence, if practitioners regularly intend to conduct cluster analysis and desire to have a rigorous methodology that delivers quantifiable results, establishing an extensive workflow with multiple stages of evaluation is likely to be worthwhile.

We have surveyed and reviewed 50 papers from 2002 to 2020 which apply cluster analysis to IR and NIR spectroscopy data. The analysis process used in these papers was compared to extensive literature from the machine learning domain. The findings highlighted a lack of quantitative analysis and evaluation in the NIR and IR cluster analysis. Of specific concern were a lack of testing for the data’s tendency to cluster and prediction of the number of clusters. These are key tests that can provide increased rigor and confidence, and widen the applicability of the cluster analysis

In a bid to improve on current practice and support researchers conducting cluster analysis on IR and NIR spectroscopy data, an analysis model has been presented to highlight potential future perspectives for the cluster analysis. The proposed analysis model or workflow incorporates quantitative techniques drawn from machine learning literature to provide rigor and ensure validity of the clustering outcomes when analyzing IR and NIR spectroscopy data.