Outliers detection for Pareto distributed data

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

This study aims to examine the presence of outliers in the upper tail of Malaysian income distribution under the assumption that the data follow Pareto model. For this purpose, three types of boxplot: standard boxplot, adjusted boxplot and generalized boxplot are considered. The performance of these boxplots is determined by a simulation study. In this study, the data were simulated from Pareto distribution, P(1, α = 2, 3, 4), then the simulated data were contaminated by replacing a proportion ϵ (3%, 5%, 10%) of randomly selected data. It is found that the generalized boxplot gives higher power value compared to the standard and adjusted boxplots. Therefore, the generalized boxplot was used for determining the presence of outliers in the upper tail of income distribution, while the threshold for Pareto tail modelling was determined by using Van Kerm's formula. The results showed that 0.4%, 0.4%, 0.9% and 1.2% outliers were detected by the generalized boxplot in the household income data that exceeded the threshold for the years of 2007, 2009, 2012 and 2014.

Original languageEnglish
Title of host publication2017 UKM FST Postgraduate Colloquium
Subtitle of host publicationProceedings of the University Kebangsaan Malaysia, Faculty of Science and Technology 2017 Postgraduate Colloquium
PublisherAmerican Institute of Physics Inc.
Volume1940
ISBN (Electronic)9780735416321
DOIs
Publication statusPublished - 4 Apr 2018
Event2017 UKM FST Postgraduate Colloquium - Selangor, Malaysia
Duration: 12 Jul 201713 Jul 2017

Other

Other2017 UKM FST Postgraduate Colloquium
CountryMalaysia
CitySelangor
Period12/7/1713/7/17

Fingerprint

income
thresholds
proportion
simulation

ASJC Scopus subject areas

  • Physics and Astronomy(all)

Cite this

Mohd Safari, M. A., Masseran, N., & Ibrahim, K. (2018). Outliers detection for Pareto distributed data. In 2017 UKM FST Postgraduate Colloquium: Proceedings of the University Kebangsaan Malaysia, Faculty of Science and Technology 2017 Postgraduate Colloquium (Vol. 1940). [020119] American Institute of Physics Inc.. https://doi.org/10.1063/1.5028034

Outliers detection for Pareto distributed data. / Mohd Safari, M. A.; Masseran, Nurulkamal; Ibrahim, Kamarulzaman.

2017 UKM FST Postgraduate Colloquium: Proceedings of the University Kebangsaan Malaysia, Faculty of Science and Technology 2017 Postgraduate Colloquium. Vol. 1940 American Institute of Physics Inc., 2018. 020119.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Mohd Safari, MA, Masseran, N & Ibrahim, K 2018, Outliers detection for Pareto distributed data. in 2017 UKM FST Postgraduate Colloquium: Proceedings of the University Kebangsaan Malaysia, Faculty of Science and Technology 2017 Postgraduate Colloquium. vol. 1940, 020119, American Institute of Physics Inc., 2017 UKM FST Postgraduate Colloquium, Selangor, Malaysia, 12/7/17. https://doi.org/10.1063/1.5028034
Mohd Safari MA, Masseran N, Ibrahim K. Outliers detection for Pareto distributed data. In 2017 UKM FST Postgraduate Colloquium: Proceedings of the University Kebangsaan Malaysia, Faculty of Science and Technology 2017 Postgraduate Colloquium. Vol. 1940. American Institute of Physics Inc. 2018. 020119 https://doi.org/10.1063/1.5028034
Mohd Safari, M. A. ; Masseran, Nurulkamal ; Ibrahim, Kamarulzaman. / Outliers detection for Pareto distributed data. 2017 UKM FST Postgraduate Colloquium: Proceedings of the University Kebangsaan Malaysia, Faculty of Science and Technology 2017 Postgraduate Colloquium. Vol. 1940 American Institute of Physics Inc., 2018.
@inproceedings{a58ffdcf3db24f6db447667e3479e0b9,
title = "Outliers detection for Pareto distributed data",
abstract = "This study aims to examine the presence of outliers in the upper tail of Malaysian income distribution under the assumption that the data follow Pareto model. For this purpose, three types of boxplot: standard boxplot, adjusted boxplot and generalized boxplot are considered. The performance of these boxplots is determined by a simulation study. In this study, the data were simulated from Pareto distribution, P(1, α = 2, 3, 4), then the simulated data were contaminated by replacing a proportion ϵ (3{\%}, 5{\%}, 10{\%}) of randomly selected data. It is found that the generalized boxplot gives higher power value compared to the standard and adjusted boxplots. Therefore, the generalized boxplot was used for determining the presence of outliers in the upper tail of income distribution, while the threshold for Pareto tail modelling was determined by using Van Kerm's formula. The results showed that 0.4{\%}, 0.4{\%}, 0.9{\%} and 1.2{\%} outliers were detected by the generalized boxplot in the household income data that exceeded the threshold for the years of 2007, 2009, 2012 and 2014.",
author = "{Mohd Safari}, {M. A.} and Nurulkamal Masseran and Kamarulzaman Ibrahim",
year = "2018",
month = "4",
day = "4",
doi = "10.1063/1.5028034",
language = "English",
volume = "1940",
booktitle = "2017 UKM FST Postgraduate Colloquium",
publisher = "American Institute of Physics Inc.",

}

TY - GEN

T1 - Outliers detection for Pareto distributed data

AU - Mohd Safari, M. A.

AU - Masseran, Nurulkamal

AU - Ibrahim, Kamarulzaman

PY - 2018/4/4

Y1 - 2018/4/4

N2 - This study aims to examine the presence of outliers in the upper tail of Malaysian income distribution under the assumption that the data follow Pareto model. For this purpose, three types of boxplot: standard boxplot, adjusted boxplot and generalized boxplot are considered. The performance of these boxplots is determined by a simulation study. In this study, the data were simulated from Pareto distribution, P(1, α = 2, 3, 4), then the simulated data were contaminated by replacing a proportion ϵ (3%, 5%, 10%) of randomly selected data. It is found that the generalized boxplot gives higher power value compared to the standard and adjusted boxplots. Therefore, the generalized boxplot was used for determining the presence of outliers in the upper tail of income distribution, while the threshold for Pareto tail modelling was determined by using Van Kerm's formula. The results showed that 0.4%, 0.4%, 0.9% and 1.2% outliers were detected by the generalized boxplot in the household income data that exceeded the threshold for the years of 2007, 2009, 2012 and 2014.

AB - This study aims to examine the presence of outliers in the upper tail of Malaysian income distribution under the assumption that the data follow Pareto model. For this purpose, three types of boxplot: standard boxplot, adjusted boxplot and generalized boxplot are considered. The performance of these boxplots is determined by a simulation study. In this study, the data were simulated from Pareto distribution, P(1, α = 2, 3, 4), then the simulated data were contaminated by replacing a proportion ϵ (3%, 5%, 10%) of randomly selected data. It is found that the generalized boxplot gives higher power value compared to the standard and adjusted boxplots. Therefore, the generalized boxplot was used for determining the presence of outliers in the upper tail of income distribution, while the threshold for Pareto tail modelling was determined by using Van Kerm's formula. The results showed that 0.4%, 0.4%, 0.9% and 1.2% outliers were detected by the generalized boxplot in the household income data that exceeded the threshold for the years of 2007, 2009, 2012 and 2014.

UR - http://www.scopus.com/inward/record.url?scp=85045671984&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85045671984&partnerID=8YFLogxK

U2 - 10.1063/1.5028034

DO - 10.1063/1.5028034

M3 - Conference contribution

AN - SCOPUS:85045671984

VL - 1940

BT - 2017 UKM FST Postgraduate Colloquium

PB - American Institute of Physics Inc.

ER -