Iterative random vs. Kennard-Stone sampling for IR spectrum-based classification task using PLS2-DA

Research output: Chapter in Book/Report/Conference proceedingConference contribution

1 Citation (Scopus)

Abstract

External testing (ET) is preferred over auto-prediction (AP) or k-fold-cross-validation in estimating more realistic predictive ability of a statistical model. With IR spectra, Kennard-stone (KS) sampling algorithm is often used to split the data into training and test sets, i.e. respectively for model construction and for model testing. On the other hand, iterative random sampling (IRS) has not been the favored choice though it is theoretically more likely to produce reliable estimation. The aim of this preliminary work is to compare performances of KS and IRS in sampling a representative training set from an attenuated total reflectance-Fourier transform infrared spectral dataset (of four varieties of blue gel pen inks) for PLS2-DA modeling. The 'best' performance achievable from the dataset is estimated with AP on the full dataset (APF, error). Both IRS (n = 200) and KS were used to split the dataset in the ratio of 7:3. The classic decision rule (i.e. maximum value-based) is employed for new sample prediction via partial least squares-discriminant analysis (PLS2-DA). Error rate of each model was estimated repeatedly via: (a) AP on full data (APF, error); (b) AP on training set (APS, error); and (c) ET on the respective test set (ETS, error). A good PLS2-DA model is expected to produce APS, error and EVS, error that is similar to the APF, error. Bearing that in mind, the similarities between (a) APS, error vs. APF, error; (b) ETS, error vs. APF, error and; (c) APS, error vs. ETS, error were evaluated using correlation tests (i.e. Pearson and Spearman's rank test), using series of PLS2-DA models computed from KS-set and IRS-set, respectively. Overall, models constructed from IRS-set exhibits more similarities between the internal and external error rates than the respective KS-set, i.e. less risk of overfitting. In conclusion, IRS is more reliable than KS in sampling representative training set.

Original languageEnglish
Title of host publication2017 UKM FST Postgraduate Colloquium
Subtitle of host publicationProceedings of the University Kebangsaan Malaysia, Faculty of Science and Technology 2017 Postgraduate Colloquium
PublisherAmerican Institute of Physics Inc.
Volume1940
ISBN (Electronic)9780735416321
DOIs
Publication statusPublished - 4 Apr 2018
Event2017 UKM FST Postgraduate Colloquium - Selangor, Malaysia
Duration: 12 Jul 201713 Jul 2017

Other

Other2017 UKM FST Postgraduate Colloquium
CountryMalaysia
CitySelangor
Period12/7/1713/7/17

Fingerprint

sampling
rocks
random sampling
education
predictions
enhanced vision
rank tests
pens
inks
estimating
gels
reflectance

Keywords

  • Forensic science
  • IR spectrum
  • Kennard-stone sampling
  • PLS-DA
  • random sampling

ASJC Scopus subject areas

  • Physics and Astronomy(all)

Cite this

Lee, L. C., Liong, C. Y., & Jemain, A. A. (2018). Iterative random vs. Kennard-Stone sampling for IR spectrum-based classification task using PLS2-DA. In 2017 UKM FST Postgraduate Colloquium: Proceedings of the University Kebangsaan Malaysia, Faculty of Science and Technology 2017 Postgraduate Colloquium (Vol. 1940). [020116] American Institute of Physics Inc.. https://doi.org/10.1063/1.5028031

Iterative random vs. Kennard-Stone sampling for IR spectrum-based classification task using PLS2-DA. / Lee, Loong Chuen; Liong, Choong Yeun; Jemain, Abdul Aziz.

2017 UKM FST Postgraduate Colloquium: Proceedings of the University Kebangsaan Malaysia, Faculty of Science and Technology 2017 Postgraduate Colloquium. Vol. 1940 American Institute of Physics Inc., 2018. 020116.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Lee, LC, Liong, CY & Jemain, AA 2018, Iterative random vs. Kennard-Stone sampling for IR spectrum-based classification task using PLS2-DA. in 2017 UKM FST Postgraduate Colloquium: Proceedings of the University Kebangsaan Malaysia, Faculty of Science and Technology 2017 Postgraduate Colloquium. vol. 1940, 020116, American Institute of Physics Inc., 2017 UKM FST Postgraduate Colloquium, Selangor, Malaysia, 12/7/17. https://doi.org/10.1063/1.5028031
Lee LC, Liong CY, Jemain AA. Iterative random vs. Kennard-Stone sampling for IR spectrum-based classification task using PLS2-DA. In 2017 UKM FST Postgraduate Colloquium: Proceedings of the University Kebangsaan Malaysia, Faculty of Science and Technology 2017 Postgraduate Colloquium. Vol. 1940. American Institute of Physics Inc. 2018. 020116 https://doi.org/10.1063/1.5028031
Lee, Loong Chuen ; Liong, Choong Yeun ; Jemain, Abdul Aziz. / Iterative random vs. Kennard-Stone sampling for IR spectrum-based classification task using PLS2-DA. 2017 UKM FST Postgraduate Colloquium: Proceedings of the University Kebangsaan Malaysia, Faculty of Science and Technology 2017 Postgraduate Colloquium. Vol. 1940 American Institute of Physics Inc., 2018.
@inproceedings{e63c0c89790646a9a7851940d440297f,
title = "Iterative random vs. Kennard-Stone sampling for IR spectrum-based classification task using PLS2-DA",
abstract = "External testing (ET) is preferred over auto-prediction (AP) or k-fold-cross-validation in estimating more realistic predictive ability of a statistical model. With IR spectra, Kennard-stone (KS) sampling algorithm is often used to split the data into training and test sets, i.e. respectively for model construction and for model testing. On the other hand, iterative random sampling (IRS) has not been the favored choice though it is theoretically more likely to produce reliable estimation. The aim of this preliminary work is to compare performances of KS and IRS in sampling a representative training set from an attenuated total reflectance-Fourier transform infrared spectral dataset (of four varieties of blue gel pen inks) for PLS2-DA modeling. The 'best' performance achievable from the dataset is estimated with AP on the full dataset (APF, error). Both IRS (n = 200) and KS were used to split the dataset in the ratio of 7:3. The classic decision rule (i.e. maximum value-based) is employed for new sample prediction via partial least squares-discriminant analysis (PLS2-DA). Error rate of each model was estimated repeatedly via: (a) AP on full data (APF, error); (b) AP on training set (APS, error); and (c) ET on the respective test set (ETS, error). A good PLS2-DA model is expected to produce APS, error and EVS, error that is similar to the APF, error. Bearing that in mind, the similarities between (a) APS, error vs. APF, error; (b) ETS, error vs. APF, error and; (c) APS, error vs. ETS, error were evaluated using correlation tests (i.e. Pearson and Spearman's rank test), using series of PLS2-DA models computed from KS-set and IRS-set, respectively. Overall, models constructed from IRS-set exhibits more similarities between the internal and external error rates than the respective KS-set, i.e. less risk of overfitting. In conclusion, IRS is more reliable than KS in sampling representative training set.",
keywords = "Forensic science, IR spectrum, Kennard-stone sampling, PLS-DA, random sampling",
author = "Lee, {Loong Chuen} and Liong, {Choong Yeun} and Jemain, {Abdul Aziz}",
year = "2018",
month = "4",
day = "4",
doi = "10.1063/1.5028031",
language = "English",
volume = "1940",
booktitle = "2017 UKM FST Postgraduate Colloquium",
publisher = "American Institute of Physics Inc.",

}

TY - GEN

T1 - Iterative random vs. Kennard-Stone sampling for IR spectrum-based classification task using PLS2-DA

AU - Lee, Loong Chuen

AU - Liong, Choong Yeun

AU - Jemain, Abdul Aziz

PY - 2018/4/4

Y1 - 2018/4/4

N2 - External testing (ET) is preferred over auto-prediction (AP) or k-fold-cross-validation in estimating more realistic predictive ability of a statistical model. With IR spectra, Kennard-stone (KS) sampling algorithm is often used to split the data into training and test sets, i.e. respectively for model construction and for model testing. On the other hand, iterative random sampling (IRS) has not been the favored choice though it is theoretically more likely to produce reliable estimation. The aim of this preliminary work is to compare performances of KS and IRS in sampling a representative training set from an attenuated total reflectance-Fourier transform infrared spectral dataset (of four varieties of blue gel pen inks) for PLS2-DA modeling. The 'best' performance achievable from the dataset is estimated with AP on the full dataset (APF, error). Both IRS (n = 200) and KS were used to split the dataset in the ratio of 7:3. The classic decision rule (i.e. maximum value-based) is employed for new sample prediction via partial least squares-discriminant analysis (PLS2-DA). Error rate of each model was estimated repeatedly via: (a) AP on full data (APF, error); (b) AP on training set (APS, error); and (c) ET on the respective test set (ETS, error). A good PLS2-DA model is expected to produce APS, error and EVS, error that is similar to the APF, error. Bearing that in mind, the similarities between (a) APS, error vs. APF, error; (b) ETS, error vs. APF, error and; (c) APS, error vs. ETS, error were evaluated using correlation tests (i.e. Pearson and Spearman's rank test), using series of PLS2-DA models computed from KS-set and IRS-set, respectively. Overall, models constructed from IRS-set exhibits more similarities between the internal and external error rates than the respective KS-set, i.e. less risk of overfitting. In conclusion, IRS is more reliable than KS in sampling representative training set.

AB - External testing (ET) is preferred over auto-prediction (AP) or k-fold-cross-validation in estimating more realistic predictive ability of a statistical model. With IR spectra, Kennard-stone (KS) sampling algorithm is often used to split the data into training and test sets, i.e. respectively for model construction and for model testing. On the other hand, iterative random sampling (IRS) has not been the favored choice though it is theoretically more likely to produce reliable estimation. The aim of this preliminary work is to compare performances of KS and IRS in sampling a representative training set from an attenuated total reflectance-Fourier transform infrared spectral dataset (of four varieties of blue gel pen inks) for PLS2-DA modeling. The 'best' performance achievable from the dataset is estimated with AP on the full dataset (APF, error). Both IRS (n = 200) and KS were used to split the dataset in the ratio of 7:3. The classic decision rule (i.e. maximum value-based) is employed for new sample prediction via partial least squares-discriminant analysis (PLS2-DA). Error rate of each model was estimated repeatedly via: (a) AP on full data (APF, error); (b) AP on training set (APS, error); and (c) ET on the respective test set (ETS, error). A good PLS2-DA model is expected to produce APS, error and EVS, error that is similar to the APF, error. Bearing that in mind, the similarities between (a) APS, error vs. APF, error; (b) ETS, error vs. APF, error and; (c) APS, error vs. ETS, error were evaluated using correlation tests (i.e. Pearson and Spearman's rank test), using series of PLS2-DA models computed from KS-set and IRS-set, respectively. Overall, models constructed from IRS-set exhibits more similarities between the internal and external error rates than the respective KS-set, i.e. less risk of overfitting. In conclusion, IRS is more reliable than KS in sampling representative training set.

KW - Forensic science

KW - IR spectrum

KW - Kennard-stone sampling

KW - PLS-DA

KW - random sampling

UR - http://www.scopus.com/inward/record.url?scp=85045658197&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85045658197&partnerID=8YFLogxK

U2 - 10.1063/1.5028031

DO - 10.1063/1.5028031

M3 - Conference contribution

VL - 1940

BT - 2017 UKM FST Postgraduate Colloquium

PB - American Institute of Physics Inc.

ER -