Web crawler and back-end for news aggregator system (Noox project)

Raymond Bahana, Rahadian Adinugroho, Ford Lumban Gaol, Agung Trisetyarso, Bahtiar Saleh Abbas, Wayan Suparta

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

The aim of this manuscript is to develop a web crawler, Content Management System (CMS), and Application Programming Interface (API) for a news aggregator system dubbed Noox project. News aggregator system requires a crawler to populate its content; however, different website may have a different page layout. This projects aims to create a back-end for Noox and open source and scalable web crawler that is capable to extract data from different page layout. The scope of this project includes the web crawler, CMS, API, scheduler for web crawler, and implementation of socket server. The development process utilizes PHP as the primary back-end language with Laravel as the web application framework. The PHP back-end hosts the CMS and the API. The API itself will implement Representational State Transfer (REST) as its architectural design. JavaScript Object Notation (JSON) is used as means of the API to communicate with the clients. The API also implements JSON Web Token (JWT) for client authentication purpose. Python programming language is used to develop the web crawler. The web crawler utilizes BeautifulSoup as the web extraction utility. The web crawler can be adapted to extract data from different page layout by utilizing user created configuration file. It can also be configured to export the extracted data into a JSON file or database system. The result of this project satisfies the requirement of the Noox project.

Original languageEnglish
Title of host publication2017 IEEE International Conference on Cybernetics and Computational Intelligence, CyberneticsCOM 2017 - Proceedings
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages56-61
Number of pages6
Volume2017-November
ISBN (Electronic)9781538607831
DOIs
Publication statusPublished - 9 Mar 2018
Externally publishedYes
Event2017 IEEE International Conference on Cybernetics and Computational Intelligence, CyberneticsCOM 2017 - Phuket, Thailand
Duration: 20 Nov 201722 Nov 2017

Other

Other2017 IEEE International Conference on Cybernetics and Computational Intelligence, CyberneticsCOM 2017
CountryThailand
CityPhuket
Period20/11/1722/11/17

Fingerprint

Application programming interfaces (API)
Web crawler
Architectural design
Computer programming languages
Authentication
Websites
Servers

Keywords

  • Application Programming Interface
  • CMS
  • HTML Extraction
  • Web Crawler
  • Web Scrapper

ASJC Scopus subject areas

  • Artificial Intelligence
  • Computer Vision and Pattern Recognition
  • Software

Cite this

Bahana, R., Adinugroho, R., Gaol, F. L., Trisetyarso, A., Abbas, B. S., & Suparta, W. (2018). Web crawler and back-end for news aggregator system (Noox project). In 2017 IEEE International Conference on Cybernetics and Computational Intelligence, CyberneticsCOM 2017 - Proceedings (Vol. 2017-November, pp. 56-61). Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/CYBERNETICSCOM.2017.8311684

Web crawler and back-end for news aggregator system (Noox project). / Bahana, Raymond; Adinugroho, Rahadian; Gaol, Ford Lumban; Trisetyarso, Agung; Abbas, Bahtiar Saleh; Suparta, Wayan.

2017 IEEE International Conference on Cybernetics and Computational Intelligence, CyberneticsCOM 2017 - Proceedings. Vol. 2017-November Institute of Electrical and Electronics Engineers Inc., 2018. p. 56-61.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Bahana, R, Adinugroho, R, Gaol, FL, Trisetyarso, A, Abbas, BS & Suparta, W 2018, Web crawler and back-end for news aggregator system (Noox project). in 2017 IEEE International Conference on Cybernetics and Computational Intelligence, CyberneticsCOM 2017 - Proceedings. vol. 2017-November, Institute of Electrical and Electronics Engineers Inc., pp. 56-61, 2017 IEEE International Conference on Cybernetics and Computational Intelligence, CyberneticsCOM 2017, Phuket, Thailand, 20/11/17. https://doi.org/10.1109/CYBERNETICSCOM.2017.8311684
Bahana R, Adinugroho R, Gaol FL, Trisetyarso A, Abbas BS, Suparta W. Web crawler and back-end for news aggregator system (Noox project). In 2017 IEEE International Conference on Cybernetics and Computational Intelligence, CyberneticsCOM 2017 - Proceedings. Vol. 2017-November. Institute of Electrical and Electronics Engineers Inc. 2018. p. 56-61 https://doi.org/10.1109/CYBERNETICSCOM.2017.8311684
Bahana, Raymond ; Adinugroho, Rahadian ; Gaol, Ford Lumban ; Trisetyarso, Agung ; Abbas, Bahtiar Saleh ; Suparta, Wayan. / Web crawler and back-end for news aggregator system (Noox project). 2017 IEEE International Conference on Cybernetics and Computational Intelligence, CyberneticsCOM 2017 - Proceedings. Vol. 2017-November Institute of Electrical and Electronics Engineers Inc., 2018. pp. 56-61
@inproceedings{3ea9439e62724a349c6571b1553eb138,
title = "Web crawler and back-end for news aggregator system (Noox project)",
abstract = "The aim of this manuscript is to develop a web crawler, Content Management System (CMS), and Application Programming Interface (API) for a news aggregator system dubbed Noox project. News aggregator system requires a crawler to populate its content; however, different website may have a different page layout. This projects aims to create a back-end for Noox and open source and scalable web crawler that is capable to extract data from different page layout. The scope of this project includes the web crawler, CMS, API, scheduler for web crawler, and implementation of socket server. The development process utilizes PHP as the primary back-end language with Laravel as the web application framework. The PHP back-end hosts the CMS and the API. The API itself will implement Representational State Transfer (REST) as its architectural design. JavaScript Object Notation (JSON) is used as means of the API to communicate with the clients. The API also implements JSON Web Token (JWT) for client authentication purpose. Python programming language is used to develop the web crawler. The web crawler utilizes BeautifulSoup as the web extraction utility. The web crawler can be adapted to extract data from different page layout by utilizing user created configuration file. It can also be configured to export the extracted data into a JSON file or database system. The result of this project satisfies the requirement of the Noox project.",
keywords = "Application Programming Interface, CMS, HTML Extraction, Web Crawler, Web Scrapper",
author = "Raymond Bahana and Rahadian Adinugroho and Gaol, {Ford Lumban} and Agung Trisetyarso and Abbas, {Bahtiar Saleh} and Wayan Suparta",
year = "2018",
month = "3",
day = "9",
doi = "10.1109/CYBERNETICSCOM.2017.8311684",
language = "English",
volume = "2017-November",
pages = "56--61",
booktitle = "2017 IEEE International Conference on Cybernetics and Computational Intelligence, CyberneticsCOM 2017 - Proceedings",
publisher = "Institute of Electrical and Electronics Engineers Inc.",

}

TY - GEN

T1 - Web crawler and back-end for news aggregator system (Noox project)

AU - Bahana, Raymond

AU - Adinugroho, Rahadian

AU - Gaol, Ford Lumban

AU - Trisetyarso, Agung

AU - Abbas, Bahtiar Saleh

AU - Suparta, Wayan

PY - 2018/3/9

Y1 - 2018/3/9

N2 - The aim of this manuscript is to develop a web crawler, Content Management System (CMS), and Application Programming Interface (API) for a news aggregator system dubbed Noox project. News aggregator system requires a crawler to populate its content; however, different website may have a different page layout. This projects aims to create a back-end for Noox and open source and scalable web crawler that is capable to extract data from different page layout. The scope of this project includes the web crawler, CMS, API, scheduler for web crawler, and implementation of socket server. The development process utilizes PHP as the primary back-end language with Laravel as the web application framework. The PHP back-end hosts the CMS and the API. The API itself will implement Representational State Transfer (REST) as its architectural design. JavaScript Object Notation (JSON) is used as means of the API to communicate with the clients. The API also implements JSON Web Token (JWT) for client authentication purpose. Python programming language is used to develop the web crawler. The web crawler utilizes BeautifulSoup as the web extraction utility. The web crawler can be adapted to extract data from different page layout by utilizing user created configuration file. It can also be configured to export the extracted data into a JSON file or database system. The result of this project satisfies the requirement of the Noox project.

AB - The aim of this manuscript is to develop a web crawler, Content Management System (CMS), and Application Programming Interface (API) for a news aggregator system dubbed Noox project. News aggregator system requires a crawler to populate its content; however, different website may have a different page layout. This projects aims to create a back-end for Noox and open source and scalable web crawler that is capable to extract data from different page layout. The scope of this project includes the web crawler, CMS, API, scheduler for web crawler, and implementation of socket server. The development process utilizes PHP as the primary back-end language with Laravel as the web application framework. The PHP back-end hosts the CMS and the API. The API itself will implement Representational State Transfer (REST) as its architectural design. JavaScript Object Notation (JSON) is used as means of the API to communicate with the clients. The API also implements JSON Web Token (JWT) for client authentication purpose. Python programming language is used to develop the web crawler. The web crawler utilizes BeautifulSoup as the web extraction utility. The web crawler can be adapted to extract data from different page layout by utilizing user created configuration file. It can also be configured to export the extracted data into a JSON file or database system. The result of this project satisfies the requirement of the Noox project.

KW - Application Programming Interface

KW - CMS

KW - HTML Extraction

KW - Web Crawler

KW - Web Scrapper

UR - http://www.scopus.com/inward/record.url?scp=85050741505&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85050741505&partnerID=8YFLogxK

U2 - 10.1109/CYBERNETICSCOM.2017.8311684

DO - 10.1109/CYBERNETICSCOM.2017.8311684

M3 - Conference contribution

AN - SCOPUS:85050741505

VL - 2017-November

SP - 56

EP - 61

BT - 2017 IEEE International Conference on Cybernetics and Computational Intelligence, CyberneticsCOM 2017 - Proceedings

PB - Institute of Electrical and Electronics Engineers Inc.

ER -