HomeTechnologyDemocratic Governance

Website Parse Template

Website Parse Template (WPT) is an XML based open format which provides HTML structure description of website pages. WPT format allows web crawlers to generate Semantic Web’s RDFs for web pages (see example here).

"Website Parse Template" consists of following sections:

  • Ontology, where publisher defines concepts and relations which are used in his website.
  • Templates, where publisher provides templates for groups of web pages which are similar by their content category and structure. Publisher provides the HTML elements’ XPath or TagIDs and links with website Ontology concepts.
  • URLs, where publisher provides URL Pattern which collect the group of web pages linking them to "Parse Template". In the URLs section publisher can separate form URLs the part as a concept and link to website Ontology.

Website parse template begins with opening <icdl> tag and ends with closing </icdl> tag. Single website parse template is referred to the same host, while single host may have several website parse templates describing its HTML structure. It is required to specify the host for website parse template at the beginning in <icdl> tag:

<icdl host="http://www.anyhost.com">

Detailed description of website parse template based on the example of Yahoo! Music website is provided below.

Example 1. Website Parse Template ( visual representation )


Ontology

Ontology section contains enumeration and definition of all possible concepts used in website. This section must be enclosed within <ontology> and </ontology> tags. In <ontology> tag it is required to specify ontology name and language used for concepts definition. As an ontology name can be chosen any string, but for the language it is necessary to indicate certain language type, e.g. "icdl:ontology", "owl", "unl:uws".

Example 2. Ontology section
<ontology name="general" language="icdl:ontology">
<concept name="artist">

<inherit concept="person"></inherit>
<has object="name"></has>
<has object="track"></has>
<has object="image"></has>
<has object="bio"></has>
<has object="id"></has>
<has object="fullname"></has>

</concept>
<concept name="Logo"></concept>
<concept name="Menu"></concept>
<concept name="Advertisement"></concept>
</ontology>

Each concept's definition should start with <concept> tag and ends with </concept> tag. <inherit> tag shows inheritance relation between two concepts and <has> tag shows attributable relations.

Either of defined concepts has default attribute - object identifier (id) to be used by web crawlers/extractors to co-ordinate the same object's attributes used in different pages of a given website.

There are several predefined concepts which are general for all kind of websites:

"Menu" - navigation bar/menu
"Logo" - design element/logo
"Content" - element that contains main textual content of the page
"Advertisement" - advertisement/banner
"External Link" - element that contains external links

Templates

Templates section contains a number of templates for groups of similarly structured web pages. Either of those templates refers to a single group of similarly structured web pages. HTML elements' XPath references or TagIDs are used for linking structured content with defined concepts. The template description starts with opening <template> tag and ends with closing </template> tag. In <template> tag it is required to specify template name and language used for templates description. As a template name can be chosen any string, but for the language it is necessary to indicate supported language type, e.g. "icdl:template", "rdf", "unl:expression".

Example 3. Templates section
<template name="Artist page on Yahoo! Music" language="icdl:template">
<html_tag tagid="yent-uhdr" content="Menu"/>
<html_tag xpath="/html/body/div[2]/div/div/div[3]/div/a/span" content="Logo"/>
<html_tag xpath="/html/body/div/div" content="Advertisement"/>
<html_tag xpath="/html/body/div[3]/table/tbody/tr/td[2]/table/tbody/tr[2]/td/div/h1" content="artist.name"/>
<html_tag xpath="/html/body/div[3]/table/tbody/tr/td[2]/table/tbody/tr[3]/td/table/tbody/tr/td/img" content="artist.image"/>
<html_tag xpath="/html/body/div[3]/table/tbody/tr/td[2]/table/tbody/tr[7]/td" content="artist.bio" reference="Artist Bio"/>
<container container_xpath="/html/body/div[3]/table/tbody/tr/td[2]/table/tbody/tr[10]/td/table">

<repeatable_block block_xpath="/html/body/div[3]/table/tbody/tr/td[2]/table/tbody/tr[10]/td/table/tbody/tr/td">
<html_tag xpath="/html/body/div[3]/table/tbody/tr/td[2]/table/tbody/tr[10]/td/table/tbody/tr/td" content="artist.track"/>
</repeatable_block>

</container>
</template>
<template name="Artist Bio" language="icdl:template">

<html_tag xpath="/html/body/div[3]/table/tbody/tr/td[2]/table/tbody/tr[2]/td/div/h1" content="artist.name"/>
<html_tag xpath="/html/body/div[3]/table/tbody/tr/td[2]/table/tbody/tr[7]/td" content="artist.bio"/>

</template>

The web page may contain structured repeatable content (<repeatable_block>) included in one main HTML element (<container>). If specified complex HTML element is already described by another template the <reference> tag can be used to point to that template block as follows:

<html_tag xpath="/html/body/div[3]/table/tbody/tr/td[2]/table/tbody/tr[7]/td" content="artist.bio" reference= "Artist Bio"/>

It makes possible to create hierarchic relations between template blocks so that web crawlers can use specified reference(s) to identify the same object in different pages of a given website.

URLs

This section defines the URLs or URL patterns that are corresponding to groups of similarly structured web pages described in Templates section. In accordance with Templates section URLs section also may consist of several blocks and either of those blocks should start with <urls> tag and ends with </urls> tag.

Example 4. URLs section
<urls name="Artist page on Yahoo! Music" template="Artist page on Yahoo! Music">

<url url="http://music.yahoo.com/ar-8206256---Amy-Winehouse"/>
<url url="http://music.yahoo.com/ar-(artist.id[[0-9]*])---(artist.fullname[[a-z,a-z,-,0-9]*])"/>

</urls>
<urls name="Artist biography" template="Artist Bio">

<url url="http://music.yahoo.com/ar-8206256-bio--Amy-Winehouse"/>
<url url="http://music.yahoo.com/ar-(artist.id[[0-9]*])-bio--(artist.fullname[[a-z,a-z,-,0-9]*])"/>

</urls>

As a URLs block name can be chosen any string, but for the template it is necessary to indicate certain template name described in previous section. The URL patterns provided in Example 4 also include the represented real URLs. RegExp specifications are used for URL patterns descriptions. The concepts necessary for URL pattern definition (such as "id" and "fullname") are to be defined previously in ontology section.