Open Web Crawling is a distributed web crawling mechanism executing jointly by multiple legal entities for the creation of integrated Open Web Repository of World Wide Web content. The main idea of Open Web Crawling is to make it possible for internet users, internet companies, web masters, website publishers and text analyzing service providers to cooperate in a process of Open Web Repository content aggregation through public web pages’ repeated crawling.
World Wide Web is very huge today and continues to grow up day by day, so that it becomes impossible to crawl whole WWW solely. Open Web Crawling provides know-how approach of using synchronization protocols for decentralize, parallelize and wise web crawling.
Open Web Crawling is carried out through client/server system where Crawling Manager acts as a controller and dispatcher for a great number of open-source web crawlers that are installed on volunteers’ computers (see figure below).
|

Crawling Manager chooses web crawlers by member-ID and crawler-ID for sending the hosts list to be crawled. Hosts lists can be periodically reshuffled to prevent possible spoofing of the results.
OMFICA Crawler uses computer’s free CPU resources only and sequentially visits websites and parses the content based on Website Parse Templates. Crawling results are checked doubly by Crawling Manager and stored then in Open Web Repository.
Crawling Manager based on double check results generates black list of volunteers who are trying to cheat the system.
|
Open Web Repository is periodically scanned, and the stored data is posted into archived files categorized by hourly, daily, weekly and monthly updates. Those files’ folders have specified access permissions for eligible companies and individuals - OMFICA Beneficiaries.
|
|
|