A comparison of various imputation methods for missing values in air quality data

Research output: Contribution to journalArticle

13 Citations (Scopus)

Abstract

This paper presents various imputation methods for air quality data specifically in Malaysia. The main objective was to select the best method of imputation and to compare whether there was any difference in the methods used between stations in Peninsular Malaysia. Missing data for various cases are randomly simulated with 5, 10, 15, 20, 25 and 30% missing. Six methods used in this paper were mean and median substitution, expectation-maximization (EM) method, singular value decomposition (SVD), K-nearest neighbour (KNN) method and sequential K-nearest neighbour (SKNN) method. The performance of the imputations is compared using the performance indicator: The correlation coefficient (R), the index of agreement (d) and the mean absolute error (MAE). Based on the result obtained, it can be concluded that EM, KNN and SKNN are the three best methods. The same result are obtained for all the eight monitoring station used in this study.

Original languageEnglish
Pages (from-to)449-456
Number of pages8
JournalSains Malaysiana
Volume44
Issue number3
Publication statusPublished - 1 Mar 2015

Fingerprint

Air quality
Singular value decomposition
Substitution reactions
Monitoring

Keywords

  • Imputation techniques
  • Missing data
  • Performance indicators

ASJC Scopus subject areas

  • General

Cite this

A comparison of various imputation methods for missing values in air quality data. / Ahmat Zainuri, Nuryazmin; Jemain, Abdul Aziz; Muda, Nora.

In: Sains Malaysiana, Vol. 44, No. 3, 01.03.2015, p. 449-456.

Research output: Contribution to journalArticle

@article{2024d4f5f02943cdb956030dbfb4ab16,
title = "A comparison of various imputation methods for missing values in air quality data",
abstract = "This paper presents various imputation methods for air quality data specifically in Malaysia. The main objective was to select the best method of imputation and to compare whether there was any difference in the methods used between stations in Peninsular Malaysia. Missing data for various cases are randomly simulated with 5, 10, 15, 20, 25 and 30{\%} missing. Six methods used in this paper were mean and median substitution, expectation-maximization (EM) method, singular value decomposition (SVD), K-nearest neighbour (KNN) method and sequential K-nearest neighbour (SKNN) method. The performance of the imputations is compared using the performance indicator: The correlation coefficient (R), the index of agreement (d) and the mean absolute error (MAE). Based on the result obtained, it can be concluded that EM, KNN and SKNN are the three best methods. The same result are obtained for all the eight monitoring station used in this study.",
keywords = "Imputation techniques, Missing data, Performance indicators",
author = "{Ahmat Zainuri}, Nuryazmin and Jemain, {Abdul Aziz} and Nora Muda",
year = "2015",
month = "3",
day = "1",
language = "English",
volume = "44",
pages = "449--456",
journal = "Sains Malaysiana",
issn = "0126-6039",
publisher = "Penerbit Universiti Kebangsaan Malaysia",
number = "3",

}

TY - JOUR

T1 - A comparison of various imputation methods for missing values in air quality data

AU - Ahmat Zainuri, Nuryazmin

AU - Jemain, Abdul Aziz

AU - Muda, Nora

PY - 2015/3/1

Y1 - 2015/3/1

N2 - This paper presents various imputation methods for air quality data specifically in Malaysia. The main objective was to select the best method of imputation and to compare whether there was any difference in the methods used between stations in Peninsular Malaysia. Missing data for various cases are randomly simulated with 5, 10, 15, 20, 25 and 30% missing. Six methods used in this paper were mean and median substitution, expectation-maximization (EM) method, singular value decomposition (SVD), K-nearest neighbour (KNN) method and sequential K-nearest neighbour (SKNN) method. The performance of the imputations is compared using the performance indicator: The correlation coefficient (R), the index of agreement (d) and the mean absolute error (MAE). Based on the result obtained, it can be concluded that EM, KNN and SKNN are the three best methods. The same result are obtained for all the eight monitoring station used in this study.

AB - This paper presents various imputation methods for air quality data specifically in Malaysia. The main objective was to select the best method of imputation and to compare whether there was any difference in the methods used between stations in Peninsular Malaysia. Missing data for various cases are randomly simulated with 5, 10, 15, 20, 25 and 30% missing. Six methods used in this paper were mean and median substitution, expectation-maximization (EM) method, singular value decomposition (SVD), K-nearest neighbour (KNN) method and sequential K-nearest neighbour (SKNN) method. The performance of the imputations is compared using the performance indicator: The correlation coefficient (R), the index of agreement (d) and the mean absolute error (MAE). Based on the result obtained, it can be concluded that EM, KNN and SKNN are the three best methods. The same result are obtained for all the eight monitoring station used in this study.

KW - Imputation techniques

KW - Missing data

KW - Performance indicators

UR - http://www.scopus.com/inward/record.url?scp=84942327539&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84942327539&partnerID=8YFLogxK

M3 - Article

AN - SCOPUS:84942327539

VL - 44

SP - 449

EP - 456

JO - Sains Malaysiana

JF - Sains Malaysiana

SN - 0126-6039

IS - 3

ER -