Single channel speech enhancement using ideal binary mask technique based on computational auditory scene analysis

Research output: Contribution to journalArticle

4 Citations (Scopus)

Abstract

Single channel speech enhancement is necessary where the multichannel speech enhancement is not feasible due to space constraints in the intended device and cost-effectiveness. However, the problem of having limited information from single channel sound signal mixtures or unavailability of the speech source signals makes it more difficult to separate the target speech from the background maskers in the acoustic environment of low signal to noise ratio, in various background noises and in less temporal duration of speech signals. To address these problems, computational auditory analysis became popular from the last decade as a new concept for speech enhancement. In this paper, ideal binary mask which is inspired by the computational auditory analysis is used to analyze and synthesize the input speech signals and masker signals in the time-frequency domain, where all the signals usually overlap. Synthesized signals are evaluated for speech quality measurement in terms of segmental signal-to-noise ratio. This study uses Malay language based speech as input speech signals. These input speech signals vary in duration due to their word structure. Large crowd babble speech and two talker competing speech are employed as masker signals. The input signal-to-noise ratio is varied from -5 dB to +15 dB in steps of 5 dB to vary the difficulty level of acoustic environment. Results show that ideal binary mask algorithm reconstructs the target speech signals efficiently from the degraded and noisy speech signals. This is signified by the high segmental signal-to-noise ratio even in the lowest input signal-to-noise ratio. This type of high noise reduction is necessary to lessen the burden of elderly listener’s listening effort in noisy environment.

Original languageEnglish
Pages (from-to)12-22
Number of pages11
JournalJournal of Theoretical and Applied Information Technology
Volume91
Issue number1
Publication statusPublished - 15 Sep 2016

Fingerprint

Scene Analysis
Speech Enhancement
Speech enhancement
Computational Analysis
Mask
Masks
Binary
Speech Signal
Signal to noise ratio
Acoustics
Vary
Target
Necessary
Cost-effectiveness
Noise Reduction
Cost effectiveness
Noise abatement
Frequency Domain
Speech
Time Domain

Keywords

  • Computational auditory scene analysis
  • Ideal binary mask
  • Speech enhancement
  • Speech quality
  • Time-frequency masking

ASJC Scopus subject areas

  • Computer Science(all)
  • Theoretical Computer Science

Cite this

@article{403c9e27c60d4acaa65c7720b40b949e,
title = "Single channel speech enhancement using ideal binary mask technique based on computational auditory scene analysis",
abstract = "Single channel speech enhancement is necessary where the multichannel speech enhancement is not feasible due to space constraints in the intended device and cost-effectiveness. However, the problem of having limited information from single channel sound signal mixtures or unavailability of the speech source signals makes it more difficult to separate the target speech from the background maskers in the acoustic environment of low signal to noise ratio, in various background noises and in less temporal duration of speech signals. To address these problems, computational auditory analysis became popular from the last decade as a new concept for speech enhancement. In this paper, ideal binary mask which is inspired by the computational auditory analysis is used to analyze and synthesize the input speech signals and masker signals in the time-frequency domain, where all the signals usually overlap. Synthesized signals are evaluated for speech quality measurement in terms of segmental signal-to-noise ratio. This study uses Malay language based speech as input speech signals. These input speech signals vary in duration due to their word structure. Large crowd babble speech and two talker competing speech are employed as masker signals. The input signal-to-noise ratio is varied from -5 dB to +15 dB in steps of 5 dB to vary the difficulty level of acoustic environment. Results show that ideal binary mask algorithm reconstructs the target speech signals efficiently from the degraded and noisy speech signals. This is signified by the high segmental signal-to-noise ratio even in the lowest input signal-to-noise ratio. This type of high noise reduction is necessary to lessen the burden of elderly listener’s listening effort in noisy environment.",
keywords = "Computational auditory scene analysis, Ideal binary mask, Speech enhancement, Speech quality, Time-frequency masking",
author = "Abrar Hussain and Kalaivani Chell and Mukari, {Siti Zamratol Mai Sarah}",
year = "2016",
month = "9",
day = "15",
language = "English",
volume = "91",
pages = "12--22",
journal = "Journal of Theoretical and Applied Information Technology",
issn = "1992-8645",
publisher = "Asian Research Publishing Network (ARPN)",
number = "1",

}

TY - JOUR

T1 - Single channel speech enhancement using ideal binary mask technique based on computational auditory scene analysis

AU - Hussain, Abrar

AU - Chell, Kalaivani

AU - Mukari, Siti Zamratol Mai Sarah

PY - 2016/9/15

Y1 - 2016/9/15

N2 - Single channel speech enhancement is necessary where the multichannel speech enhancement is not feasible due to space constraints in the intended device and cost-effectiveness. However, the problem of having limited information from single channel sound signal mixtures or unavailability of the speech source signals makes it more difficult to separate the target speech from the background maskers in the acoustic environment of low signal to noise ratio, in various background noises and in less temporal duration of speech signals. To address these problems, computational auditory analysis became popular from the last decade as a new concept for speech enhancement. In this paper, ideal binary mask which is inspired by the computational auditory analysis is used to analyze and synthesize the input speech signals and masker signals in the time-frequency domain, where all the signals usually overlap. Synthesized signals are evaluated for speech quality measurement in terms of segmental signal-to-noise ratio. This study uses Malay language based speech as input speech signals. These input speech signals vary in duration due to their word structure. Large crowd babble speech and two talker competing speech are employed as masker signals. The input signal-to-noise ratio is varied from -5 dB to +15 dB in steps of 5 dB to vary the difficulty level of acoustic environment. Results show that ideal binary mask algorithm reconstructs the target speech signals efficiently from the degraded and noisy speech signals. This is signified by the high segmental signal-to-noise ratio even in the lowest input signal-to-noise ratio. This type of high noise reduction is necessary to lessen the burden of elderly listener’s listening effort in noisy environment.

AB - Single channel speech enhancement is necessary where the multichannel speech enhancement is not feasible due to space constraints in the intended device and cost-effectiveness. However, the problem of having limited information from single channel sound signal mixtures or unavailability of the speech source signals makes it more difficult to separate the target speech from the background maskers in the acoustic environment of low signal to noise ratio, in various background noises and in less temporal duration of speech signals. To address these problems, computational auditory analysis became popular from the last decade as a new concept for speech enhancement. In this paper, ideal binary mask which is inspired by the computational auditory analysis is used to analyze and synthesize the input speech signals and masker signals in the time-frequency domain, where all the signals usually overlap. Synthesized signals are evaluated for speech quality measurement in terms of segmental signal-to-noise ratio. This study uses Malay language based speech as input speech signals. These input speech signals vary in duration due to their word structure. Large crowd babble speech and two talker competing speech are employed as masker signals. The input signal-to-noise ratio is varied from -5 dB to +15 dB in steps of 5 dB to vary the difficulty level of acoustic environment. Results show that ideal binary mask algorithm reconstructs the target speech signals efficiently from the degraded and noisy speech signals. This is signified by the high segmental signal-to-noise ratio even in the lowest input signal-to-noise ratio. This type of high noise reduction is necessary to lessen the burden of elderly listener’s listening effort in noisy environment.

KW - Computational auditory scene analysis

KW - Ideal binary mask

KW - Speech enhancement

KW - Speech quality

KW - Time-frequency masking

UR - http://www.scopus.com/inward/record.url?scp=84987817711&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84987817711&partnerID=8YFLogxK

M3 - Article

VL - 91

SP - 12

EP - 22

JO - Journal of Theoretical and Applied Information Technology

JF - Journal of Theoretical and Applied Information Technology

SN - 1992-8645

IS - 1

ER -