Experimental approach based on ensemble and frequent itemset mining for image spam filtering

Research output: Contribution to journalArticle

1 Citation (Scopus)

Abstract

Excessive amounts of image spam cause many problems to e-mail users. Since image spam is difficult to detect using conventional text-based spam approach, various image processing techniques have been proposed. In this paper, we present an ensemble method using frequent itemset mining (FIM) for filtering image spam. Despite the fact that FIM techniques are well established in data mining, it is not commonly used in the ensemble method. In order to obtain a good filtering performance, a SIFT descriptor is used since it is widely known as effective image descriptors. K-mean clustering is applied to the SIFT keypoints which produce a visual codebook. The bag-of-word (BOW) feature vectors for each image is generated using a hard bag-of-features (HBOF) approach. FIM descriptors are obtained from the frequent itemsets of the BOW feature vectors. We combine BOW, FIM with another three different feature selections, namely Information Gain (IG), Symmetrical Uncertainty (SU) and Chi Square (CS) with a Spatial Pyramid in an ensemble method. We have performed experiments on Dredze and SpamArchive datasets. The results show that our ensemble that uses the frequent itemsets mining has significantly outperform the traditional BOW and naive approach that combines all descriptors directly in a very large single input vector.

Original languageEnglish
Pages (from-to)121-126
Number of pages6
JournalJournal of Telecommunication, Electronic and Computer Engineering
Volume10
Issue number1-5
Publication statusPublished - 1 Jan 2018

Fingerprint

Data mining
Feature extraction
Image processing
Experiments
Uncertainty

Keywords

  • Ensemble Methods
  • Frequent Itemset Mining
  • Image Spam
  • SVM

ASJC Scopus subject areas

  • Hardware and Architecture
  • Computer Networks and Communications
  • Electrical and Electronic Engineering

Cite this

@article{bc7ea67f808644d1af729d1d52675f97,
title = "Experimental approach based on ensemble and frequent itemset mining for image spam filtering",
abstract = "Excessive amounts of image spam cause many problems to e-mail users. Since image spam is difficult to detect using conventional text-based spam approach, various image processing techniques have been proposed. In this paper, we present an ensemble method using frequent itemset mining (FIM) for filtering image spam. Despite the fact that FIM techniques are well established in data mining, it is not commonly used in the ensemble method. In order to obtain a good filtering performance, a SIFT descriptor is used since it is widely known as effective image descriptors. K-mean clustering is applied to the SIFT keypoints which produce a visual codebook. The bag-of-word (BOW) feature vectors for each image is generated using a hard bag-of-features (HBOF) approach. FIM descriptors are obtained from the frequent itemsets of the BOW feature vectors. We combine BOW, FIM with another three different feature selections, namely Information Gain (IG), Symmetrical Uncertainty (SU) and Chi Square (CS) with a Spatial Pyramid in an ensemble method. We have performed experiments on Dredze and SpamArchive datasets. The results show that our ensemble that uses the frequent itemsets mining has significantly outperform the traditional BOW and naive approach that combines all descriptors directly in a very large single input vector.",
keywords = "Ensemble Methods, Frequent Itemset Mining, Image Spam, SVM",
author = "{Mat Ariff}, {Nor Azman} and Azizi Abdullah and Nasrudin, {Mohammad Faidzul}",
year = "2018",
month = "1",
day = "1",
language = "English",
volume = "10",
pages = "121--126",
journal = "Journal of Telecommunication, Electronic and Computer Engineering",
issn = "2180-1843",
publisher = "Universiti Teknikal Malaysia Melaka",
number = "1-5",

}

TY - JOUR

T1 - Experimental approach based on ensemble and frequent itemset mining for image spam filtering

AU - Mat Ariff, Nor Azman

AU - Abdullah, Azizi

AU - Nasrudin, Mohammad Faidzul

PY - 2018/1/1

Y1 - 2018/1/1

N2 - Excessive amounts of image spam cause many problems to e-mail users. Since image spam is difficult to detect using conventional text-based spam approach, various image processing techniques have been proposed. In this paper, we present an ensemble method using frequent itemset mining (FIM) for filtering image spam. Despite the fact that FIM techniques are well established in data mining, it is not commonly used in the ensemble method. In order to obtain a good filtering performance, a SIFT descriptor is used since it is widely known as effective image descriptors. K-mean clustering is applied to the SIFT keypoints which produce a visual codebook. The bag-of-word (BOW) feature vectors for each image is generated using a hard bag-of-features (HBOF) approach. FIM descriptors are obtained from the frequent itemsets of the BOW feature vectors. We combine BOW, FIM with another three different feature selections, namely Information Gain (IG), Symmetrical Uncertainty (SU) and Chi Square (CS) with a Spatial Pyramid in an ensemble method. We have performed experiments on Dredze and SpamArchive datasets. The results show that our ensemble that uses the frequent itemsets mining has significantly outperform the traditional BOW and naive approach that combines all descriptors directly in a very large single input vector.

AB - Excessive amounts of image spam cause many problems to e-mail users. Since image spam is difficult to detect using conventional text-based spam approach, various image processing techniques have been proposed. In this paper, we present an ensemble method using frequent itemset mining (FIM) for filtering image spam. Despite the fact that FIM techniques are well established in data mining, it is not commonly used in the ensemble method. In order to obtain a good filtering performance, a SIFT descriptor is used since it is widely known as effective image descriptors. K-mean clustering is applied to the SIFT keypoints which produce a visual codebook. The bag-of-word (BOW) feature vectors for each image is generated using a hard bag-of-features (HBOF) approach. FIM descriptors are obtained from the frequent itemsets of the BOW feature vectors. We combine BOW, FIM with another three different feature selections, namely Information Gain (IG), Symmetrical Uncertainty (SU) and Chi Square (CS) with a Spatial Pyramid in an ensemble method. We have performed experiments on Dredze and SpamArchive datasets. The results show that our ensemble that uses the frequent itemsets mining has significantly outperform the traditional BOW and naive approach that combines all descriptors directly in a very large single input vector.

KW - Ensemble Methods

KW - Frequent Itemset Mining

KW - Image Spam

KW - SVM

UR - http://www.scopus.com/inward/record.url?scp=85045194376&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85045194376&partnerID=8YFLogxK

M3 - Article

VL - 10

SP - 121

EP - 126

JO - Journal of Telecommunication, Electronic and Computer Engineering

JF - Journal of Telecommunication, Electronic and Computer Engineering

SN - 2180-1843

IS - 1-5

ER -