Validity of the best practice in splitting data for hold-out validation strategy as performed on the ink strokes in the context of forensic science

Research output: Contribution to journalArticle

4 Citations (Scopus)

Abstract

External testing (ET), known also as the hold-out validation, is currently considered to be one of the most reliable ways to estimate predictive ability of a statistical model. One safeguard to prevent impermissible peeking in ET is to ensure all replicates of a particular sample is only included in either the test or the training set. Assuming a sample X1 consists of two replicates (i.e. X1a and X1b). The model is claimed to enjoy impermissible peeking if the X1a and X1b are split into the training and the test sets, respectively. Eventually, the resulting prediction model is expected to predict the test sets easily and presents an over-optimistic model performance. In forensic document examinations, an individual pen (IP) can be used to produce multiple ink strokes. In real-world practice, pens are manufactured via bulk production such that one big tank of ink is used to produce a wealth of IPs. In other words, ink strokes produced by varying IPs but of the same pen model are indeed originated from one single source (i.e. the same tank of ink). Eventually, with respect to the aforementioned safeguard, how shall one treat the ink strokes? Are they replicates or independent samples? In this context, the aim of the work is to investigate the validity of the safeguard in splitting dataset for hold-out validation strategy (i.e. ET) in the domain of forensic pen ink analysis. An infrared (IR) spectra of blue gel pen inks was used to demonstrate the practical aspect. The IR spectral data were collected from 1361 ink strokes that originated from 273 IPs of 23 pen models and 10 pen brands. Iterative stratified random sampling was employed to prepare 1000 pairs of training and test sets that were split at ratio 7:3 using two different principles: (a) set IP - selection was conducted at IP level to ensure all the ink strokes originated from a particular IP must be included into either the training or the test sets only; and (b) set NIP - ink strokes of a particular IP were allowed to be spread between the training and the test sets. For each dataset, a series of 50 PLS-DA models were constructed by including the first 50 PLS components incrementally, which were then validated via auto-prediction and ET. Following that, the performances between IP and NIP model series were compared with respect to: (a) model accuracy; (b) model stability; and (c) model fitting. In conclusion, the NIP model series do not show any evidence of advantages from the impermissible peeking since both the NIP and IP model series exhibit quite similar performances in all the three model aspects.

Original languageEnglish
Pages (from-to)125-133
Number of pages9
JournalMicrochemical Journal
Volume139
DOIs
Publication statusPublished - 1 Jun 2018

Fingerprint

Ink
Light pens
Testing
Forensic science
Infrared radiation
Gels
Sampling

Keywords

  • Data splitting
  • Forensic science
  • IR spectrum
  • Model validation
  • PLS-DA
  • Replicates

ASJC Scopus subject areas

  • Analytical Chemistry
  • Spectroscopy

Cite this

@article{fb62cab241cb4b5ca317a6d239b88653,
title = "Validity of the best practice in splitting data for hold-out validation strategy as performed on the ink strokes in the context of forensic science",
abstract = "External testing (ET), known also as the hold-out validation, is currently considered to be one of the most reliable ways to estimate predictive ability of a statistical model. One safeguard to prevent impermissible peeking in ET is to ensure all replicates of a particular sample is only included in either the test or the training set. Assuming a sample X1 consists of two replicates (i.e. X1a and X1b). The model is claimed to enjoy impermissible peeking if the X1a and X1b are split into the training and the test sets, respectively. Eventually, the resulting prediction model is expected to predict the test sets easily and presents an over-optimistic model performance. In forensic document examinations, an individual pen (IP) can be used to produce multiple ink strokes. In real-world practice, pens are manufactured via bulk production such that one big tank of ink is used to produce a wealth of IPs. In other words, ink strokes produced by varying IPs but of the same pen model are indeed originated from one single source (i.e. the same tank of ink). Eventually, with respect to the aforementioned safeguard, how shall one treat the ink strokes? Are they replicates or independent samples? In this context, the aim of the work is to investigate the validity of the safeguard in splitting dataset for hold-out validation strategy (i.e. ET) in the domain of forensic pen ink analysis. An infrared (IR) spectra of blue gel pen inks was used to demonstrate the practical aspect. The IR spectral data were collected from 1361 ink strokes that originated from 273 IPs of 23 pen models and 10 pen brands. Iterative stratified random sampling was employed to prepare 1000 pairs of training and test sets that were split at ratio 7:3 using two different principles: (a) set IP - selection was conducted at IP level to ensure all the ink strokes originated from a particular IP must be included into either the training or the test sets only; and (b) set NIP - ink strokes of a particular IP were allowed to be spread between the training and the test sets. For each dataset, a series of 50 PLS-DA models were constructed by including the first 50 PLS components incrementally, which were then validated via auto-prediction and ET. Following that, the performances between IP and NIP model series were compared with respect to: (a) model accuracy; (b) model stability; and (c) model fitting. In conclusion, the NIP model series do not show any evidence of advantages from the impermissible peeking since both the NIP and IP model series exhibit quite similar performances in all the three model aspects.",
keywords = "Data splitting, Forensic science, IR spectrum, Model validation, PLS-DA, Replicates",
author = "Lee, {Loong Chuen} and Liong, {Choong Yeun} and Jemain, {Abdul Aziz}",
year = "2018",
month = "6",
day = "1",
doi = "10.1016/j.microc.2018.02.009",
language = "English",
volume = "139",
pages = "125--133",
journal = "Microchemical Journal",
issn = "0026-265X",
publisher = "Elsevier Inc.",

}

TY - JOUR

T1 - Validity of the best practice in splitting data for hold-out validation strategy as performed on the ink strokes in the context of forensic science

AU - Lee, Loong Chuen

AU - Liong, Choong Yeun

AU - Jemain, Abdul Aziz

PY - 2018/6/1

Y1 - 2018/6/1

N2 - External testing (ET), known also as the hold-out validation, is currently considered to be one of the most reliable ways to estimate predictive ability of a statistical model. One safeguard to prevent impermissible peeking in ET is to ensure all replicates of a particular sample is only included in either the test or the training set. Assuming a sample X1 consists of two replicates (i.e. X1a and X1b). The model is claimed to enjoy impermissible peeking if the X1a and X1b are split into the training and the test sets, respectively. Eventually, the resulting prediction model is expected to predict the test sets easily and presents an over-optimistic model performance. In forensic document examinations, an individual pen (IP) can be used to produce multiple ink strokes. In real-world practice, pens are manufactured via bulk production such that one big tank of ink is used to produce a wealth of IPs. In other words, ink strokes produced by varying IPs but of the same pen model are indeed originated from one single source (i.e. the same tank of ink). Eventually, with respect to the aforementioned safeguard, how shall one treat the ink strokes? Are they replicates or independent samples? In this context, the aim of the work is to investigate the validity of the safeguard in splitting dataset for hold-out validation strategy (i.e. ET) in the domain of forensic pen ink analysis. An infrared (IR) spectra of blue gel pen inks was used to demonstrate the practical aspect. The IR spectral data were collected from 1361 ink strokes that originated from 273 IPs of 23 pen models and 10 pen brands. Iterative stratified random sampling was employed to prepare 1000 pairs of training and test sets that were split at ratio 7:3 using two different principles: (a) set IP - selection was conducted at IP level to ensure all the ink strokes originated from a particular IP must be included into either the training or the test sets only; and (b) set NIP - ink strokes of a particular IP were allowed to be spread between the training and the test sets. For each dataset, a series of 50 PLS-DA models were constructed by including the first 50 PLS components incrementally, which were then validated via auto-prediction and ET. Following that, the performances between IP and NIP model series were compared with respect to: (a) model accuracy; (b) model stability; and (c) model fitting. In conclusion, the NIP model series do not show any evidence of advantages from the impermissible peeking since both the NIP and IP model series exhibit quite similar performances in all the three model aspects.

AB - External testing (ET), known also as the hold-out validation, is currently considered to be one of the most reliable ways to estimate predictive ability of a statistical model. One safeguard to prevent impermissible peeking in ET is to ensure all replicates of a particular sample is only included in either the test or the training set. Assuming a sample X1 consists of two replicates (i.e. X1a and X1b). The model is claimed to enjoy impermissible peeking if the X1a and X1b are split into the training and the test sets, respectively. Eventually, the resulting prediction model is expected to predict the test sets easily and presents an over-optimistic model performance. In forensic document examinations, an individual pen (IP) can be used to produce multiple ink strokes. In real-world practice, pens are manufactured via bulk production such that one big tank of ink is used to produce a wealth of IPs. In other words, ink strokes produced by varying IPs but of the same pen model are indeed originated from one single source (i.e. the same tank of ink). Eventually, with respect to the aforementioned safeguard, how shall one treat the ink strokes? Are they replicates or independent samples? In this context, the aim of the work is to investigate the validity of the safeguard in splitting dataset for hold-out validation strategy (i.e. ET) in the domain of forensic pen ink analysis. An infrared (IR) spectra of blue gel pen inks was used to demonstrate the practical aspect. The IR spectral data were collected from 1361 ink strokes that originated from 273 IPs of 23 pen models and 10 pen brands. Iterative stratified random sampling was employed to prepare 1000 pairs of training and test sets that were split at ratio 7:3 using two different principles: (a) set IP - selection was conducted at IP level to ensure all the ink strokes originated from a particular IP must be included into either the training or the test sets only; and (b) set NIP - ink strokes of a particular IP were allowed to be spread between the training and the test sets. For each dataset, a series of 50 PLS-DA models were constructed by including the first 50 PLS components incrementally, which were then validated via auto-prediction and ET. Following that, the performances between IP and NIP model series were compared with respect to: (a) model accuracy; (b) model stability; and (c) model fitting. In conclusion, the NIP model series do not show any evidence of advantages from the impermissible peeking since both the NIP and IP model series exhibit quite similar performances in all the three model aspects.

KW - Data splitting

KW - Forensic science

KW - IR spectrum

KW - Model validation

KW - PLS-DA

KW - Replicates

UR - http://www.scopus.com/inward/record.url?scp=85042503383&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85042503383&partnerID=8YFLogxK

U2 - 10.1016/j.microc.2018.02.009

DO - 10.1016/j.microc.2018.02.009

M3 - Article

VL - 139

SP - 125

EP - 133

JO - Microchemical Journal

JF - Microchemical Journal

SN - 0026-265X

ER -