Experiments in malay information retrieval

Tengku Mohd T Sembok, Zainab Abu Bakar, Fatimah Ahmad

    Research output: Chapter in Book/Report/Conference proceedingConference contribution

    4 Citations (Scopus)

    Abstract

    There have been very few studies on the use of conflation algorithms for indexing and retrieval of Malay documents. The two main classes of conflation algorithms are string-similarity algorithms and stemming algorithms. Stemming is used in information retrieval systems to reduce variant word forms to common roots in order to improve retrieval effectiveness. As in other languages, there is a need for an effective stemming algorithm for the indexing and retrieval of Malay documents. Again there are few research on n-gram string-similarity measures done on Malay. We have experimented on the application of stemming and string similarity matching on retrieving of verses from the Al-Quran in order to evaluated their effectiveness. Before retrieval effectiveness can be carried out an experimental data set need to be developed which comprises of a collection of documents, a set of queries, and their relevant judgements. In this paper we will describe the development of the experimental data set and the application of stemming and similarity matching algorithms in retrieving the verses from the Al-Quran. Inherent characteristics of n-grams and several variations of experiments performed on the queries and documents are discussed. The variations are: both non-stemmed queries and documents; stemmed queries and nonstemmed documents; and both stemmed queries and documents. Further experiment are then carried out by removing the most frequently occurring n-gram. The dice-coefficient is used as threshold and weight in ranking the retrieved documents. Beside using dice coefficients to rank documents, inverse document frequency weights are also used. Interpolation technique and standard recall-precision functions are used to evaluate the retrieval effectiveness.

    Original languageEnglish
    Title of host publicationProceedings of the 2011 International Conference on Electrical Engineering and Informatics, ICEEI 2011
    DOIs
    Publication statusPublished - 2011
    Event2011 International Conference on Electrical Engineering and Informatics, ICEEI 2011 - Bandung
    Duration: 17 Jul 201119 Jul 2011

    Other

    Other2011 International Conference on Electrical Engineering and Informatics, ICEEI 2011
    CityBandung
    Period17/7/1119/7/11

    Fingerprint

    Information retrieval
    Experiments
    Information retrieval systems
    Interpolation

    Keywords

    • Information Retrieval
    • n-gram
    • Natural Language Processing
    • Stemming

    ASJC Scopus subject areas

    • Information Systems
    • Electrical and Electronic Engineering

    Cite this

    Sembok, T. M. T., Bakar, Z. A., & Ahmad, F. (2011). Experiments in malay information retrieval. In Proceedings of the 2011 International Conference on Electrical Engineering and Informatics, ICEEI 2011 [6021578] https://doi.org/10.1109/ICEEI.2011.6021578

    Experiments in malay information retrieval. / Sembok, Tengku Mohd T; Bakar, Zainab Abu; Ahmad, Fatimah.

    Proceedings of the 2011 International Conference on Electrical Engineering and Informatics, ICEEI 2011. 2011. 6021578.

    Research output: Chapter in Book/Report/Conference proceedingConference contribution

    Sembok, TMT, Bakar, ZA & Ahmad, F 2011, Experiments in malay information retrieval. in Proceedings of the 2011 International Conference on Electrical Engineering and Informatics, ICEEI 2011., 6021578, 2011 International Conference on Electrical Engineering and Informatics, ICEEI 2011, Bandung, 17/7/11. https://doi.org/10.1109/ICEEI.2011.6021578
    Sembok TMT, Bakar ZA, Ahmad F. Experiments in malay information retrieval. In Proceedings of the 2011 International Conference on Electrical Engineering and Informatics, ICEEI 2011. 2011. 6021578 https://doi.org/10.1109/ICEEI.2011.6021578
    Sembok, Tengku Mohd T ; Bakar, Zainab Abu ; Ahmad, Fatimah. / Experiments in malay information retrieval. Proceedings of the 2011 International Conference on Electrical Engineering and Informatics, ICEEI 2011. 2011.
    @inproceedings{16622303d2d14aa98e53a30c0ff87918,
    title = "Experiments in malay information retrieval",
    abstract = "There have been very few studies on the use of conflation algorithms for indexing and retrieval of Malay documents. The two main classes of conflation algorithms are string-similarity algorithms and stemming algorithms. Stemming is used in information retrieval systems to reduce variant word forms to common roots in order to improve retrieval effectiveness. As in other languages, there is a need for an effective stemming algorithm for the indexing and retrieval of Malay documents. Again there are few research on n-gram string-similarity measures done on Malay. We have experimented on the application of stemming and string similarity matching on retrieving of verses from the Al-Quran in order to evaluated their effectiveness. Before retrieval effectiveness can be carried out an experimental data set need to be developed which comprises of a collection of documents, a set of queries, and their relevant judgements. In this paper we will describe the development of the experimental data set and the application of stemming and similarity matching algorithms in retrieving the verses from the Al-Quran. Inherent characteristics of n-grams and several variations of experiments performed on the queries and documents are discussed. The variations are: both non-stemmed queries and documents; stemmed queries and nonstemmed documents; and both stemmed queries and documents. Further experiment are then carried out by removing the most frequently occurring n-gram. The dice-coefficient is used as threshold and weight in ranking the retrieved documents. Beside using dice coefficients to rank documents, inverse document frequency weights are also used. Interpolation technique and standard recall-precision functions are used to evaluate the retrieval effectiveness.",
    keywords = "Information Retrieval, n-gram, Natural Language Processing, Stemming",
    author = "Sembok, {Tengku Mohd T} and Bakar, {Zainab Abu} and Fatimah Ahmad",
    year = "2011",
    doi = "10.1109/ICEEI.2011.6021578",
    language = "English",
    isbn = "9781457707520",
    booktitle = "Proceedings of the 2011 International Conference on Electrical Engineering and Informatics, ICEEI 2011",

    }

    TY - GEN

    T1 - Experiments in malay information retrieval

    AU - Sembok, Tengku Mohd T

    AU - Bakar, Zainab Abu

    AU - Ahmad, Fatimah

    PY - 2011

    Y1 - 2011

    N2 - There have been very few studies on the use of conflation algorithms for indexing and retrieval of Malay documents. The two main classes of conflation algorithms are string-similarity algorithms and stemming algorithms. Stemming is used in information retrieval systems to reduce variant word forms to common roots in order to improve retrieval effectiveness. As in other languages, there is a need for an effective stemming algorithm for the indexing and retrieval of Malay documents. Again there are few research on n-gram string-similarity measures done on Malay. We have experimented on the application of stemming and string similarity matching on retrieving of verses from the Al-Quran in order to evaluated their effectiveness. Before retrieval effectiveness can be carried out an experimental data set need to be developed which comprises of a collection of documents, a set of queries, and their relevant judgements. In this paper we will describe the development of the experimental data set and the application of stemming and similarity matching algorithms in retrieving the verses from the Al-Quran. Inherent characteristics of n-grams and several variations of experiments performed on the queries and documents are discussed. The variations are: both non-stemmed queries and documents; stemmed queries and nonstemmed documents; and both stemmed queries and documents. Further experiment are then carried out by removing the most frequently occurring n-gram. The dice-coefficient is used as threshold and weight in ranking the retrieved documents. Beside using dice coefficients to rank documents, inverse document frequency weights are also used. Interpolation technique and standard recall-precision functions are used to evaluate the retrieval effectiveness.

    AB - There have been very few studies on the use of conflation algorithms for indexing and retrieval of Malay documents. The two main classes of conflation algorithms are string-similarity algorithms and stemming algorithms. Stemming is used in information retrieval systems to reduce variant word forms to common roots in order to improve retrieval effectiveness. As in other languages, there is a need for an effective stemming algorithm for the indexing and retrieval of Malay documents. Again there are few research on n-gram string-similarity measures done on Malay. We have experimented on the application of stemming and string similarity matching on retrieving of verses from the Al-Quran in order to evaluated their effectiveness. Before retrieval effectiveness can be carried out an experimental data set need to be developed which comprises of a collection of documents, a set of queries, and their relevant judgements. In this paper we will describe the development of the experimental data set and the application of stemming and similarity matching algorithms in retrieving the verses from the Al-Quran. Inherent characteristics of n-grams and several variations of experiments performed on the queries and documents are discussed. The variations are: both non-stemmed queries and documents; stemmed queries and nonstemmed documents; and both stemmed queries and documents. Further experiment are then carried out by removing the most frequently occurring n-gram. The dice-coefficient is used as threshold and weight in ranking the retrieved documents. Beside using dice coefficients to rank documents, inverse document frequency weights are also used. Interpolation technique and standard recall-precision functions are used to evaluate the retrieval effectiveness.

    KW - Information Retrieval

    KW - n-gram

    KW - Natural Language Processing

    KW - Stemming

    UR - http://www.scopus.com/inward/record.url?scp=80054023181&partnerID=8YFLogxK

    UR - http://www.scopus.com/inward/citedby.url?scp=80054023181&partnerID=8YFLogxK

    U2 - 10.1109/ICEEI.2011.6021578

    DO - 10.1109/ICEEI.2011.6021578

    M3 - Conference contribution

    AN - SCOPUS:80054023181

    SN - 9781457707520

    BT - Proceedings of the 2011 International Conference on Electrical Engineering and Informatics, ICEEI 2011

    ER -