A Review on Building Bilingual Comparable Corpora for Resource-limited Languages

Nurul Amelina Nasharuddin, Muhamad Taufik Abdullah, Azreen Azman, Abdul Kadir Rabiah

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Information retrieval tasks on certain Asian languages have the problem of limited knowledge resources such as the bilingual and multilingual dictionaries and corpora. Thus, there is a need to create multilingual resources for these languages. One of the ways is to automatically align document by identifying the chances that two documents are related to each other and these documents are not necessarily in one language. Multilingual corpora can then be automatically developed from these aligned documents. Numerous approaches for document alignment have been developed to date. In this paper, we gave an overview of recent progress made for bilingual and multilingual document alignments within the last 5 years. In addition, we also discussed the current progress made in developing bilingual comparable corpus especially on the Malay language, which is one of the resource-limited languages in Asia.

Original languageEnglish
Title of host publicationProceedings - 2018 4th International Conference on Information Retrieval and Knowledge Management
Subtitle of host publicationDiving into Data Sciences, CAMP 2018
EditorsShyamala Doraisamy, Azreen Azman, Dayang Nurfatimah Awg Iskandar, Muthukkaruppan Annamalai, Stefan Ruger, Fakhrul Hazman Yusoff, Nurazzah Abd. Rahman, Alistair Moffat, Shahrul Azman Mohd Noah
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages113-118
Number of pages6
ISBN (Print)9781538638125
DOIs
Publication statusPublished - 13 Sep 2018
Event4th International Conference on Information Retrieval and Knowledge Management: Diving into Data Sciences, CAMP 2018 - Kota Kinabalu, Sabah, Malaysia
Duration: 26 Mar 201828 Mar 2018

Other

Other4th International Conference on Information Retrieval and Knowledge Management: Diving into Data Sciences, CAMP 2018
CountryMalaysia
CityKota Kinabalu, Sabah
Period26/3/1828/3/18

Fingerprint

Glossaries
language
Information retrieval
resources
information retrieval
dictionary
Language
Resources
knowledge
Asia
Alignment
Knowledge resources

Keywords

  • comparable corpus
  • cross-lingual information retrieval
  • document alignment
  • Malay language

ASJC Scopus subject areas

  • Library and Information Sciences
  • Artificial Intelligence
  • Information Systems
  • Decision Sciences (miscellaneous)
  • Information Systems and Management

Cite this

Nasharuddin, N. A., Abdullah, M. T., Azman, A., & Rabiah, A. K. (2018). A Review on Building Bilingual Comparable Corpora for Resource-limited Languages. In S. Doraisamy, A. Azman, D. N. A. Iskandar, M. Annamalai, S. Ruger, F. H. Yusoff, N. Abd. Rahman, A. Moffat, ... S. A. M. Noah (Eds.), Proceedings - 2018 4th International Conference on Information Retrieval and Knowledge Management: Diving into Data Sciences, CAMP 2018 (pp. 113-118). [8464798] Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/INFRKM.2018.8464798

A Review on Building Bilingual Comparable Corpora for Resource-limited Languages. / Nasharuddin, Nurul Amelina; Abdullah, Muhamad Taufik; Azman, Azreen; Rabiah, Abdul Kadir.

Proceedings - 2018 4th International Conference on Information Retrieval and Knowledge Management: Diving into Data Sciences, CAMP 2018. ed. / Shyamala Doraisamy; Azreen Azman; Dayang Nurfatimah Awg Iskandar; Muthukkaruppan Annamalai; Stefan Ruger; Fakhrul Hazman Yusoff; Nurazzah Abd. Rahman; Alistair Moffat; Shahrul Azman Mohd Noah. Institute of Electrical and Electronics Engineers Inc., 2018. p. 113-118 8464798.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Nasharuddin, NA, Abdullah, MT, Azman, A & Rabiah, AK 2018, A Review on Building Bilingual Comparable Corpora for Resource-limited Languages. in S Doraisamy, A Azman, DNA Iskandar, M Annamalai, S Ruger, FH Yusoff, N Abd. Rahman, A Moffat & SAM Noah (eds), Proceedings - 2018 4th International Conference on Information Retrieval and Knowledge Management: Diving into Data Sciences, CAMP 2018., 8464798, Institute of Electrical and Electronics Engineers Inc., pp. 113-118, 4th International Conference on Information Retrieval and Knowledge Management: Diving into Data Sciences, CAMP 2018, Kota Kinabalu, Sabah, Malaysia, 26/3/18. https://doi.org/10.1109/INFRKM.2018.8464798
Nasharuddin NA, Abdullah MT, Azman A, Rabiah AK. A Review on Building Bilingual Comparable Corpora for Resource-limited Languages. In Doraisamy S, Azman A, Iskandar DNA, Annamalai M, Ruger S, Yusoff FH, Abd. Rahman N, Moffat A, Noah SAM, editors, Proceedings - 2018 4th International Conference on Information Retrieval and Knowledge Management: Diving into Data Sciences, CAMP 2018. Institute of Electrical and Electronics Engineers Inc. 2018. p. 113-118. 8464798 https://doi.org/10.1109/INFRKM.2018.8464798
Nasharuddin, Nurul Amelina ; Abdullah, Muhamad Taufik ; Azman, Azreen ; Rabiah, Abdul Kadir. / A Review on Building Bilingual Comparable Corpora for Resource-limited Languages. Proceedings - 2018 4th International Conference on Information Retrieval and Knowledge Management: Diving into Data Sciences, CAMP 2018. editor / Shyamala Doraisamy ; Azreen Azman ; Dayang Nurfatimah Awg Iskandar ; Muthukkaruppan Annamalai ; Stefan Ruger ; Fakhrul Hazman Yusoff ; Nurazzah Abd. Rahman ; Alistair Moffat ; Shahrul Azman Mohd Noah. Institute of Electrical and Electronics Engineers Inc., 2018. pp. 113-118
@inproceedings{96e49be4148c459ab4b0d0e32f9702ae,
title = "A Review on Building Bilingual Comparable Corpora for Resource-limited Languages",
abstract = "Information retrieval tasks on certain Asian languages have the problem of limited knowledge resources such as the bilingual and multilingual dictionaries and corpora. Thus, there is a need to create multilingual resources for these languages. One of the ways is to automatically align document by identifying the chances that two documents are related to each other and these documents are not necessarily in one language. Multilingual corpora can then be automatically developed from these aligned documents. Numerous approaches for document alignment have been developed to date. In this paper, we gave an overview of recent progress made for bilingual and multilingual document alignments within the last 5 years. In addition, we also discussed the current progress made in developing bilingual comparable corpus especially on the Malay language, which is one of the resource-limited languages in Asia.",
keywords = "comparable corpus, cross-lingual information retrieval, document alignment, Malay language",
author = "Nasharuddin, {Nurul Amelina} and Abdullah, {Muhamad Taufik} and Azreen Azman and Rabiah, {Abdul Kadir}",
year = "2018",
month = "9",
day = "13",
doi = "10.1109/INFRKM.2018.8464798",
language = "English",
isbn = "9781538638125",
pages = "113--118",
editor = "Shyamala Doraisamy and Azreen Azman and Iskandar, {Dayang Nurfatimah Awg} and Muthukkaruppan Annamalai and Stefan Ruger and Yusoff, {Fakhrul Hazman} and {Abd. Rahman}, Nurazzah and Alistair Moffat and Noah, {Shahrul Azman Mohd}",
booktitle = "Proceedings - 2018 4th International Conference on Information Retrieval and Knowledge Management",
publisher = "Institute of Electrical and Electronics Engineers Inc.",

}

TY - GEN

T1 - A Review on Building Bilingual Comparable Corpora for Resource-limited Languages

AU - Nasharuddin, Nurul Amelina

AU - Abdullah, Muhamad Taufik

AU - Azman, Azreen

AU - Rabiah, Abdul Kadir

PY - 2018/9/13

Y1 - 2018/9/13

N2 - Information retrieval tasks on certain Asian languages have the problem of limited knowledge resources such as the bilingual and multilingual dictionaries and corpora. Thus, there is a need to create multilingual resources for these languages. One of the ways is to automatically align document by identifying the chances that two documents are related to each other and these documents are not necessarily in one language. Multilingual corpora can then be automatically developed from these aligned documents. Numerous approaches for document alignment have been developed to date. In this paper, we gave an overview of recent progress made for bilingual and multilingual document alignments within the last 5 years. In addition, we also discussed the current progress made in developing bilingual comparable corpus especially on the Malay language, which is one of the resource-limited languages in Asia.

AB - Information retrieval tasks on certain Asian languages have the problem of limited knowledge resources such as the bilingual and multilingual dictionaries and corpora. Thus, there is a need to create multilingual resources for these languages. One of the ways is to automatically align document by identifying the chances that two documents are related to each other and these documents are not necessarily in one language. Multilingual corpora can then be automatically developed from these aligned documents. Numerous approaches for document alignment have been developed to date. In this paper, we gave an overview of recent progress made for bilingual and multilingual document alignments within the last 5 years. In addition, we also discussed the current progress made in developing bilingual comparable corpus especially on the Malay language, which is one of the resource-limited languages in Asia.

KW - comparable corpus

KW - cross-lingual information retrieval

KW - document alignment

KW - Malay language

UR - http://www.scopus.com/inward/record.url?scp=85054367852&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85054367852&partnerID=8YFLogxK

U2 - 10.1109/INFRKM.2018.8464798

DO - 10.1109/INFRKM.2018.8464798

M3 - Conference contribution

SN - 9781538638125

SP - 113

EP - 118

BT - Proceedings - 2018 4th International Conference on Information Retrieval and Knowledge Management

A2 - Doraisamy, Shyamala

A2 - Azman, Azreen

A2 - Iskandar, Dayang Nurfatimah Awg

A2 - Annamalai, Muthukkaruppan

A2 - Ruger, Stefan

A2 - Yusoff, Fakhrul Hazman

A2 - Abd. Rahman, Nurazzah

A2 - Moffat, Alistair

A2 - Noah, Shahrul Azman Mohd

PB - Institute of Electrical and Electronics Engineers Inc.

ER -