Abstract | The expansion of the World Wide Web has led to a chaotic state where the users of
the internet have to face and overcome the major problem of discovering information. For the
solution of this problem, many mechanisms were created based on crawlers who are browsing
the www and downloading pages. In this paper we describe “advaRSS” crawling mechanism
which intends to be the base utility for systems offering collections of news in real time to
internet user. In contrast to the common crawling mechanisms our system is focused on
fetching the latest news from the major and minor portals worldwide by utilizing their RSS
feeds. The news is produced in a random order any time of the day and thus the freshness of the
offline collection can be measured even in minutes. This means that the system has to be
updated with news every single time they occur. In order to achieve this we utilize the
communication channels that exist on the modern architecture of the WWW and more
specifically in the architecture of Web 2.0. As the RSS feeds are used by every major and
minor portal it is possible to keep our crawler up to date and retain a high freshness of the
“offline content” that is maintained in our system?s database.
|