Effects of data pre-processing methods on classification of ATR-FTIR spectra of pen inks using partial least squares-discriminant analysis (PLS-DA)

Research output: Contribution to journalArticle

2 Citations (Scopus)

Abstract

In response to our review paper [L.C. Lee et al., Chemom. Intell. Lab. Systs. 163 (2017) 64–75], we present a study that explores practical impacts of data preprocessing (DP) methods in ATR-FTIR spectra. Nine common DP methods, i.e. mean centering (MC), autoscaling (AS), Pareto scaling, robust scaling, multiplicative scatter correction (MSC), normalization to sum (NS), normalization to constant vector length (NV), standard normal variate and asymmetric least squares (AsLS), were chosen on the sake of their availability in the R software and the rather simple computation steps. An ATR-FTIR spectral dataset of blue gel pen inks that is originated from 10 different manufacturers (i.e. brands) was used in this work. The dataset is colossal (N = 1361), high dimensional (J = 5401), multi-class (C = 10), and imbalanced. In order to examine the impacts of substrate interferences, the global spectral region was further divided, arbitrarily, into three mutually exclusive local regions and analyzed independently. Following that, the resulting four sub-datasets (i.e. one based on global and three based on local regions) were preprocessed via the DP methods independently to produce 40 different sub-datasets including the raw counterparts. Partial least squares-discriminant analysis (PLS-DA) was chosen to construct a series of 50 models by including the first 50 PLS components incrementally. The modeling was performed independently for each of the 40 sub-datasets. Each model was evaluated repeatedly using autoprediction, six variants of v-fold cross validation (v = 2, 4, 5, 7, 10, 15) and external testing schemes. As a results, empirical performances of each DP methods are represented by 400 different error rates (8 model validation schemes × 50 models). Performances of each DP method was then compared against its raw counterparts according to summary statistics and hypothesis tests. In addition, principal component analysis and hierarchical clustering analysis were also employed, respectively, to illustrate the spatial distribution and the similarity between the nine DP methods and the raw counterparts. Several important remarks have been drawn from the rigorous comparative analyses. First, due to the inherent properties of ATR-FTIR spectra, DP methods that handling slope, e.g. MSC and AsLS, have appeared to be the most excellent DP methods. Second, normalization methods, either NS or NV, ranked the second best-performing DP method. Third, MC shows no impact on the raw IR spectral dataset. Fourth, it is shown that outliers in the ATR-FTIR spectra of pen inks could be localized. Last but not least, removal of irrelevant signals arising from sample substrate is best achieved via region truncation rather than via PLS or DP methods alone.

Original languageEnglish
Pages (from-to)90-100
Number of pages11
JournalChemometrics and Intelligent Laboratory Systems
Volume182
DOIs
Publication statusPublished - 15 Nov 2018

Fingerprint

Discriminant analysis
Ink
Processing
Substrates
Principal component analysis
Spatial distribution
Gels
Statistics
Availability
Testing

Keywords

  • Data preprocessing
  • Forensic science
  • IR spectrum
  • PLS-DA

ASJC Scopus subject areas

  • Analytical Chemistry
  • Software
  • Process Chemistry and Technology
  • Spectroscopy
  • Computer Science Applications

Cite this

@article{de1a371255f143739d5a91f20fc68da7,
title = "Effects of data pre-processing methods on classification of ATR-FTIR spectra of pen inks using partial least squares-discriminant analysis (PLS-DA)",
abstract = "In response to our review paper [L.C. Lee et al., Chemom. Intell. Lab. Systs. 163 (2017) 64–75], we present a study that explores practical impacts of data preprocessing (DP) methods in ATR-FTIR spectra. Nine common DP methods, i.e. mean centering (MC), autoscaling (AS), Pareto scaling, robust scaling, multiplicative scatter correction (MSC), normalization to sum (NS), normalization to constant vector length (NV), standard normal variate and asymmetric least squares (AsLS), were chosen on the sake of their availability in the R software and the rather simple computation steps. An ATR-FTIR spectral dataset of blue gel pen inks that is originated from 10 different manufacturers (i.e. brands) was used in this work. The dataset is colossal (N = 1361), high dimensional (J = 5401), multi-class (C = 10), and imbalanced. In order to examine the impacts of substrate interferences, the global spectral region was further divided, arbitrarily, into three mutually exclusive local regions and analyzed independently. Following that, the resulting four sub-datasets (i.e. one based on global and three based on local regions) were preprocessed via the DP methods independently to produce 40 different sub-datasets including the raw counterparts. Partial least squares-discriminant analysis (PLS-DA) was chosen to construct a series of 50 models by including the first 50 PLS components incrementally. The modeling was performed independently for each of the 40 sub-datasets. Each model was evaluated repeatedly using autoprediction, six variants of v-fold cross validation (v = 2, 4, 5, 7, 10, 15) and external testing schemes. As a results, empirical performances of each DP methods are represented by 400 different error rates (8 model validation schemes × 50 models). Performances of each DP method was then compared against its raw counterparts according to summary statistics and hypothesis tests. In addition, principal component analysis and hierarchical clustering analysis were also employed, respectively, to illustrate the spatial distribution and the similarity between the nine DP methods and the raw counterparts. Several important remarks have been drawn from the rigorous comparative analyses. First, due to the inherent properties of ATR-FTIR spectra, DP methods that handling slope, e.g. MSC and AsLS, have appeared to be the most excellent DP methods. Second, normalization methods, either NS or NV, ranked the second best-performing DP method. Third, MC shows no impact on the raw IR spectral dataset. Fourth, it is shown that outliers in the ATR-FTIR spectra of pen inks could be localized. Last but not least, removal of irrelevant signals arising from sample substrate is best achieved via region truncation rather than via PLS or DP methods alone.",
keywords = "Data preprocessing, Forensic science, IR spectrum, PLS-DA",
author = "Lee, {Loong Chuen} and Liong, {Choong Yeun} and Jemain, {Abdul Aziz}",
year = "2018",
month = "11",
day = "15",
doi = "10.1016/j.chemolab.2018.09.001",
language = "English",
volume = "182",
pages = "90--100",
journal = "Chemometrics and Intelligent Laboratory Systems",
issn = "0169-7439",
publisher = "Elsevier",

}

TY - JOUR

T1 - Effects of data pre-processing methods on classification of ATR-FTIR spectra of pen inks using partial least squares-discriminant analysis (PLS-DA)

AU - Lee, Loong Chuen

AU - Liong, Choong Yeun

AU - Jemain, Abdul Aziz

PY - 2018/11/15

Y1 - 2018/11/15

N2 - In response to our review paper [L.C. Lee et al., Chemom. Intell. Lab. Systs. 163 (2017) 64–75], we present a study that explores practical impacts of data preprocessing (DP) methods in ATR-FTIR spectra. Nine common DP methods, i.e. mean centering (MC), autoscaling (AS), Pareto scaling, robust scaling, multiplicative scatter correction (MSC), normalization to sum (NS), normalization to constant vector length (NV), standard normal variate and asymmetric least squares (AsLS), were chosen on the sake of their availability in the R software and the rather simple computation steps. An ATR-FTIR spectral dataset of blue gel pen inks that is originated from 10 different manufacturers (i.e. brands) was used in this work. The dataset is colossal (N = 1361), high dimensional (J = 5401), multi-class (C = 10), and imbalanced. In order to examine the impacts of substrate interferences, the global spectral region was further divided, arbitrarily, into three mutually exclusive local regions and analyzed independently. Following that, the resulting four sub-datasets (i.e. one based on global and three based on local regions) were preprocessed via the DP methods independently to produce 40 different sub-datasets including the raw counterparts. Partial least squares-discriminant analysis (PLS-DA) was chosen to construct a series of 50 models by including the first 50 PLS components incrementally. The modeling was performed independently for each of the 40 sub-datasets. Each model was evaluated repeatedly using autoprediction, six variants of v-fold cross validation (v = 2, 4, 5, 7, 10, 15) and external testing schemes. As a results, empirical performances of each DP methods are represented by 400 different error rates (8 model validation schemes × 50 models). Performances of each DP method was then compared against its raw counterparts according to summary statistics and hypothesis tests. In addition, principal component analysis and hierarchical clustering analysis were also employed, respectively, to illustrate the spatial distribution and the similarity between the nine DP methods and the raw counterparts. Several important remarks have been drawn from the rigorous comparative analyses. First, due to the inherent properties of ATR-FTIR spectra, DP methods that handling slope, e.g. MSC and AsLS, have appeared to be the most excellent DP methods. Second, normalization methods, either NS or NV, ranked the second best-performing DP method. Third, MC shows no impact on the raw IR spectral dataset. Fourth, it is shown that outliers in the ATR-FTIR spectra of pen inks could be localized. Last but not least, removal of irrelevant signals arising from sample substrate is best achieved via region truncation rather than via PLS or DP methods alone.

AB - In response to our review paper [L.C. Lee et al., Chemom. Intell. Lab. Systs. 163 (2017) 64–75], we present a study that explores practical impacts of data preprocessing (DP) methods in ATR-FTIR spectra. Nine common DP methods, i.e. mean centering (MC), autoscaling (AS), Pareto scaling, robust scaling, multiplicative scatter correction (MSC), normalization to sum (NS), normalization to constant vector length (NV), standard normal variate and asymmetric least squares (AsLS), were chosen on the sake of their availability in the R software and the rather simple computation steps. An ATR-FTIR spectral dataset of blue gel pen inks that is originated from 10 different manufacturers (i.e. brands) was used in this work. The dataset is colossal (N = 1361), high dimensional (J = 5401), multi-class (C = 10), and imbalanced. In order to examine the impacts of substrate interferences, the global spectral region was further divided, arbitrarily, into three mutually exclusive local regions and analyzed independently. Following that, the resulting four sub-datasets (i.e. one based on global and three based on local regions) were preprocessed via the DP methods independently to produce 40 different sub-datasets including the raw counterparts. Partial least squares-discriminant analysis (PLS-DA) was chosen to construct a series of 50 models by including the first 50 PLS components incrementally. The modeling was performed independently for each of the 40 sub-datasets. Each model was evaluated repeatedly using autoprediction, six variants of v-fold cross validation (v = 2, 4, 5, 7, 10, 15) and external testing schemes. As a results, empirical performances of each DP methods are represented by 400 different error rates (8 model validation schemes × 50 models). Performances of each DP method was then compared against its raw counterparts according to summary statistics and hypothesis tests. In addition, principal component analysis and hierarchical clustering analysis were also employed, respectively, to illustrate the spatial distribution and the similarity between the nine DP methods and the raw counterparts. Several important remarks have been drawn from the rigorous comparative analyses. First, due to the inherent properties of ATR-FTIR spectra, DP methods that handling slope, e.g. MSC and AsLS, have appeared to be the most excellent DP methods. Second, normalization methods, either NS or NV, ranked the second best-performing DP method. Third, MC shows no impact on the raw IR spectral dataset. Fourth, it is shown that outliers in the ATR-FTIR spectra of pen inks could be localized. Last but not least, removal of irrelevant signals arising from sample substrate is best achieved via region truncation rather than via PLS or DP methods alone.

KW - Data preprocessing

KW - Forensic science

KW - IR spectrum

KW - PLS-DA

UR - http://www.scopus.com/inward/record.url?scp=85053528741&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85053528741&partnerID=8YFLogxK

U2 - 10.1016/j.chemolab.2018.09.001

DO - 10.1016/j.chemolab.2018.09.001

M3 - Article

VL - 182

SP - 90

EP - 100

JO - Chemometrics and Intelligent Laboratory Systems

JF - Chemometrics and Intelligent Laboratory Systems

SN - 0169-7439

ER -