Predictive modelling of colossal ATR-FTIR spectral data using PLS-DA: Empirical differences between PLS1-DA and PLS2-DA algorithms

Loong Chuen Lee, Abdul Aziz Jemain

Research output: Contribution to journalArticle

1 Citation (Scopus)

Abstract

In response to our review paper [L. C. Lee et al., Analyst, 2018, 143, 3526-3539], we present a study that compares empirical differences between PLS1-DA and PLS2-DA algorithms in modelling a colossal ATR-FTIR spectral dataset. Over the past two decades, partial least squares-discriminant analysis (PLS-DA) has gained wide acceptance and huge popularity in the field of applied research, partly due to its dimensionality reduction capability and ability to handle multicollinear and correlated variables. To solve a K-class problem (K > 2) using PLS-DA and high-dimensional data like infrared spectra, one can construct either K one-versus-all PLS1-DA models or only one PLS2-DA model. The aim of this work is to explore empirical differences between the two PLS-DA algorithms in modeling a colossal ATR-FTIR spectral dataset. The practical task is to build a prediction model using the imbalanced, high dimensional, colossal and multi-class ATR-FTIR spectra of blue gel pen inks. Four different sub-datasets were prepared from the principal dataset by considering the raw and asymmetric least squares (AsLS) preprocessed forms: (a) Raw-global region; (b) Raw-local region; (c) AsLS-global region; and (d) AsLS-local region. A series of 50 models which includes the first 50 PLS components incrementally was constructed repeatedly using the four sub-datasets. Each model was evaluated using six different variants of v-fold cross validation, autoprediction and external testing methods. As a result, each PLS-DA algorithm was represented by a number of figures of merit. The differences between PLS1-DA and PLS2-DA algorithms were assessed using hypothesis tests with respect to model accuracy, stability and fitting. On the other hand, confusion matrices of the two PLS-DA algorithms were inspected carefully for assessment of model parsimony. Overall, both the algorithms presented satisfactory model accuracy and stability. Nonetheless, PLS1-DA models showed significantly higher accuracy rates than PLS2-DA models, whereas PLS2-DA models seem to be much more stable compared to PLS1-DA models. Eventually, PLS2-DA also proved to be less prone to overfitting and is more parsimonious than PLS1-DA. In conclusion, the relatively high accuracy of the PLS1-DA algorithm is achieved at the cost of rather low parsimony and stability, and with an increased risk of overfitting.

Original languageEnglish
Pages (from-to)2670-2678
Number of pages9
JournalAnalyst
Volume144
Issue number8
DOIs
Publication statusPublished - 21 Apr 2019

Fingerprint

Discriminant Analysis
Discriminant analysis
Fourier Transform Infrared Spectroscopy
discriminant analysis
Least-Squares Analysis
modeling
Ink
testing method
Gels
Datasets
gel
Research
fold
Infrared radiation

ASJC Scopus subject areas

  • Analytical Chemistry
  • Biochemistry
  • Environmental Chemistry
  • Spectroscopy
  • Electrochemistry

Cite this

Predictive modelling of colossal ATR-FTIR spectral data using PLS-DA : Empirical differences between PLS1-DA and PLS2-DA algorithms. / Lee, Loong Chuen; Jemain, Abdul Aziz.

In: Analyst, Vol. 144, No. 8, 21.04.2019, p. 2670-2678.

Research output: Contribution to journalArticle

@article{989e9d81d255405f9294ea372ac560d1,
title = "Predictive modelling of colossal ATR-FTIR spectral data using PLS-DA: Empirical differences between PLS1-DA and PLS2-DA algorithms",
abstract = "In response to our review paper [L. C. Lee et al., Analyst, 2018, 143, 3526-3539], we present a study that compares empirical differences between PLS1-DA and PLS2-DA algorithms in modelling a colossal ATR-FTIR spectral dataset. Over the past two decades, partial least squares-discriminant analysis (PLS-DA) has gained wide acceptance and huge popularity in the field of applied research, partly due to its dimensionality reduction capability and ability to handle multicollinear and correlated variables. To solve a K-class problem (K > 2) using PLS-DA and high-dimensional data like infrared spectra, one can construct either K one-versus-all PLS1-DA models or only one PLS2-DA model. The aim of this work is to explore empirical differences between the two PLS-DA algorithms in modeling a colossal ATR-FTIR spectral dataset. The practical task is to build a prediction model using the imbalanced, high dimensional, colossal and multi-class ATR-FTIR spectra of blue gel pen inks. Four different sub-datasets were prepared from the principal dataset by considering the raw and asymmetric least squares (AsLS) preprocessed forms: (a) Raw-global region; (b) Raw-local region; (c) AsLS-global region; and (d) AsLS-local region. A series of 50 models which includes the first 50 PLS components incrementally was constructed repeatedly using the four sub-datasets. Each model was evaluated using six different variants of v-fold cross validation, autoprediction and external testing methods. As a result, each PLS-DA algorithm was represented by a number of figures of merit. The differences between PLS1-DA and PLS2-DA algorithms were assessed using hypothesis tests with respect to model accuracy, stability and fitting. On the other hand, confusion matrices of the two PLS-DA algorithms were inspected carefully for assessment of model parsimony. Overall, both the algorithms presented satisfactory model accuracy and stability. Nonetheless, PLS1-DA models showed significantly higher accuracy rates than PLS2-DA models, whereas PLS2-DA models seem to be much more stable compared to PLS1-DA models. Eventually, PLS2-DA also proved to be less prone to overfitting and is more parsimonious than PLS1-DA. In conclusion, the relatively high accuracy of the PLS1-DA algorithm is achieved at the cost of rather low parsimony and stability, and with an increased risk of overfitting.",
author = "Lee, {Loong Chuen} and Jemain, {Abdul Aziz}",
year = "2019",
month = "4",
day = "21",
doi = "10.1039/c8an02074d",
language = "English",
volume = "144",
pages = "2670--2678",
journal = "The Analyst",
issn = "0003-2654",
publisher = "Royal Society of Chemistry",
number = "8",

}

TY - JOUR

T1 - Predictive modelling of colossal ATR-FTIR spectral data using PLS-DA

T2 - Empirical differences between PLS1-DA and PLS2-DA algorithms

AU - Lee, Loong Chuen

AU - Jemain, Abdul Aziz

PY - 2019/4/21

Y1 - 2019/4/21

N2 - In response to our review paper [L. C. Lee et al., Analyst, 2018, 143, 3526-3539], we present a study that compares empirical differences between PLS1-DA and PLS2-DA algorithms in modelling a colossal ATR-FTIR spectral dataset. Over the past two decades, partial least squares-discriminant analysis (PLS-DA) has gained wide acceptance and huge popularity in the field of applied research, partly due to its dimensionality reduction capability and ability to handle multicollinear and correlated variables. To solve a K-class problem (K > 2) using PLS-DA and high-dimensional data like infrared spectra, one can construct either K one-versus-all PLS1-DA models or only one PLS2-DA model. The aim of this work is to explore empirical differences between the two PLS-DA algorithms in modeling a colossal ATR-FTIR spectral dataset. The practical task is to build a prediction model using the imbalanced, high dimensional, colossal and multi-class ATR-FTIR spectra of blue gel pen inks. Four different sub-datasets were prepared from the principal dataset by considering the raw and asymmetric least squares (AsLS) preprocessed forms: (a) Raw-global region; (b) Raw-local region; (c) AsLS-global region; and (d) AsLS-local region. A series of 50 models which includes the first 50 PLS components incrementally was constructed repeatedly using the four sub-datasets. Each model was evaluated using six different variants of v-fold cross validation, autoprediction and external testing methods. As a result, each PLS-DA algorithm was represented by a number of figures of merit. The differences between PLS1-DA and PLS2-DA algorithms were assessed using hypothesis tests with respect to model accuracy, stability and fitting. On the other hand, confusion matrices of the two PLS-DA algorithms were inspected carefully for assessment of model parsimony. Overall, both the algorithms presented satisfactory model accuracy and stability. Nonetheless, PLS1-DA models showed significantly higher accuracy rates than PLS2-DA models, whereas PLS2-DA models seem to be much more stable compared to PLS1-DA models. Eventually, PLS2-DA also proved to be less prone to overfitting and is more parsimonious than PLS1-DA. In conclusion, the relatively high accuracy of the PLS1-DA algorithm is achieved at the cost of rather low parsimony and stability, and with an increased risk of overfitting.

AB - In response to our review paper [L. C. Lee et al., Analyst, 2018, 143, 3526-3539], we present a study that compares empirical differences between PLS1-DA and PLS2-DA algorithms in modelling a colossal ATR-FTIR spectral dataset. Over the past two decades, partial least squares-discriminant analysis (PLS-DA) has gained wide acceptance and huge popularity in the field of applied research, partly due to its dimensionality reduction capability and ability to handle multicollinear and correlated variables. To solve a K-class problem (K > 2) using PLS-DA and high-dimensional data like infrared spectra, one can construct either K one-versus-all PLS1-DA models or only one PLS2-DA model. The aim of this work is to explore empirical differences between the two PLS-DA algorithms in modeling a colossal ATR-FTIR spectral dataset. The practical task is to build a prediction model using the imbalanced, high dimensional, colossal and multi-class ATR-FTIR spectra of blue gel pen inks. Four different sub-datasets were prepared from the principal dataset by considering the raw and asymmetric least squares (AsLS) preprocessed forms: (a) Raw-global region; (b) Raw-local region; (c) AsLS-global region; and (d) AsLS-local region. A series of 50 models which includes the first 50 PLS components incrementally was constructed repeatedly using the four sub-datasets. Each model was evaluated using six different variants of v-fold cross validation, autoprediction and external testing methods. As a result, each PLS-DA algorithm was represented by a number of figures of merit. The differences between PLS1-DA and PLS2-DA algorithms were assessed using hypothesis tests with respect to model accuracy, stability and fitting. On the other hand, confusion matrices of the two PLS-DA algorithms were inspected carefully for assessment of model parsimony. Overall, both the algorithms presented satisfactory model accuracy and stability. Nonetheless, PLS1-DA models showed significantly higher accuracy rates than PLS2-DA models, whereas PLS2-DA models seem to be much more stable compared to PLS1-DA models. Eventually, PLS2-DA also proved to be less prone to overfitting and is more parsimonious than PLS1-DA. In conclusion, the relatively high accuracy of the PLS1-DA algorithm is achieved at the cost of rather low parsimony and stability, and with an increased risk of overfitting.

UR - http://www.scopus.com/inward/record.url?scp=85064179750&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85064179750&partnerID=8YFLogxK

U2 - 10.1039/c8an02074d

DO - 10.1039/c8an02074d

M3 - Article

C2 - 30849143

AN - SCOPUS:85064179750

VL - 144

SP - 2670

EP - 2678

JO - The Analyst

JF - The Analyst

SN - 0003-2654

IS - 8

ER -