Data categorization and model weighting approach for language model adaptation in statistical machine translation

Mohammed AbuHamad, Masnizah Mohd

Research output: Contribution to journalArticle

Abstract

Language model encapsulates semantic, syntactic and pragmatic information about specific task. Intelligent systems especially natural language processing systems can show different results in terms of performance and precision when moving among genres and domains. Therefore researchers have explored different language model adaptation strategies in order to overcome effectiveness issue. There are two main categories in language model adaptation techniques. The first category includes the techniques that based on the data selection where task-oriented corpus can be extracted and used to train and generate models for specific translations. While the second category focuses on developing a weighting criterion to assign the test data to specific model corpus. The purpose of this research is to introduce language model adaptation approach that combines both categories (data selection and weighting criterion) of language model adaptation. This approach applies data selection for specific-task translations by dividing the corpus into smaller and topic-related corpora using clustering process. We investigate the effect of different approaches for clustering the bilingual data on the language model adaptation process in terms of translation quality using the Europarl corpus WMT07 that includes bilingual data for English-Spanish, English-German and English-French. A mixture of language models should assign any given data to the right language model to be used in the translation process using a specific weighting criterion. The proposed language model adaptation has achieved better translation quality compare to the baseline model in Statistical Machine Translation (SMT).

Original languageEnglish
Pages (from-to)135-141
Number of pages7
JournalInternational Journal of Advanced Computer Science and Applications
Volume10
Issue number1
DOIs
Publication statusPublished - 1 Jan 2019

Fingerprint

Natural language processing systems
Intelligent systems
Syntactics
Semantics

Keywords

  • Clustering
  • Language model adaptation
  • Statistical machine translation

ASJC Scopus subject areas

  • Computer Science(all)

Cite this

@article{f58bd6fc748e478babd34682182a2eae,
title = "Data categorization and model weighting approach for language model adaptation in statistical machine translation",
abstract = "Language model encapsulates semantic, syntactic and pragmatic information about specific task. Intelligent systems especially natural language processing systems can show different results in terms of performance and precision when moving among genres and domains. Therefore researchers have explored different language model adaptation strategies in order to overcome effectiveness issue. There are two main categories in language model adaptation techniques. The first category includes the techniques that based on the data selection where task-oriented corpus can be extracted and used to train and generate models for specific translations. While the second category focuses on developing a weighting criterion to assign the test data to specific model corpus. The purpose of this research is to introduce language model adaptation approach that combines both categories (data selection and weighting criterion) of language model adaptation. This approach applies data selection for specific-task translations by dividing the corpus into smaller and topic-related corpora using clustering process. We investigate the effect of different approaches for clustering the bilingual data on the language model adaptation process in terms of translation quality using the Europarl corpus WMT07 that includes bilingual data for English-Spanish, English-German and English-French. A mixture of language models should assign any given data to the right language model to be used in the translation process using a specific weighting criterion. The proposed language model adaptation has achieved better translation quality compare to the baseline model in Statistical Machine Translation (SMT).",
keywords = "Clustering, Language model adaptation, Statistical machine translation",
author = "Mohammed AbuHamad and Masnizah Mohd",
year = "2019",
month = "1",
day = "1",
doi = "10.14569/IJACSA.2019.0100117",
language = "English",
volume = "10",
pages = "135--141",
journal = "International Journal of Advanced Computer Science and Applications",
issn = "2158-107X",
publisher = "Science and Information Organization",
number = "1",

}

TY - JOUR

T1 - Data categorization and model weighting approach for language model adaptation in statistical machine translation

AU - AbuHamad, Mohammed

AU - Mohd, Masnizah

PY - 2019/1/1

Y1 - 2019/1/1

N2 - Language model encapsulates semantic, syntactic and pragmatic information about specific task. Intelligent systems especially natural language processing systems can show different results in terms of performance and precision when moving among genres and domains. Therefore researchers have explored different language model adaptation strategies in order to overcome effectiveness issue. There are two main categories in language model adaptation techniques. The first category includes the techniques that based on the data selection where task-oriented corpus can be extracted and used to train and generate models for specific translations. While the second category focuses on developing a weighting criterion to assign the test data to specific model corpus. The purpose of this research is to introduce language model adaptation approach that combines both categories (data selection and weighting criterion) of language model adaptation. This approach applies data selection for specific-task translations by dividing the corpus into smaller and topic-related corpora using clustering process. We investigate the effect of different approaches for clustering the bilingual data on the language model adaptation process in terms of translation quality using the Europarl corpus WMT07 that includes bilingual data for English-Spanish, English-German and English-French. A mixture of language models should assign any given data to the right language model to be used in the translation process using a specific weighting criterion. The proposed language model adaptation has achieved better translation quality compare to the baseline model in Statistical Machine Translation (SMT).

AB - Language model encapsulates semantic, syntactic and pragmatic information about specific task. Intelligent systems especially natural language processing systems can show different results in terms of performance and precision when moving among genres and domains. Therefore researchers have explored different language model adaptation strategies in order to overcome effectiveness issue. There are two main categories in language model adaptation techniques. The first category includes the techniques that based on the data selection where task-oriented corpus can be extracted and used to train and generate models for specific translations. While the second category focuses on developing a weighting criterion to assign the test data to specific model corpus. The purpose of this research is to introduce language model adaptation approach that combines both categories (data selection and weighting criterion) of language model adaptation. This approach applies data selection for specific-task translations by dividing the corpus into smaller and topic-related corpora using clustering process. We investigate the effect of different approaches for clustering the bilingual data on the language model adaptation process in terms of translation quality using the Europarl corpus WMT07 that includes bilingual data for English-Spanish, English-German and English-French. A mixture of language models should assign any given data to the right language model to be used in the translation process using a specific weighting criterion. The proposed language model adaptation has achieved better translation quality compare to the baseline model in Statistical Machine Translation (SMT).

KW - Clustering

KW - Language model adaptation

KW - Statistical machine translation

UR - http://www.scopus.com/inward/record.url?scp=85062991137&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85062991137&partnerID=8YFLogxK

U2 - 10.14569/IJACSA.2019.0100117

DO - 10.14569/IJACSA.2019.0100117

M3 - Article

VL - 10

SP - 135

EP - 141

JO - International Journal of Advanced Computer Science and Applications

JF - International Journal of Advanced Computer Science and Applications

SN - 2158-107X

IS - 1

ER -