Internal:Speakers/IM1

IM1: Ina O' Murchu, et al

Title: Free Knowledge (ID IM1)

Language: English or German | License: ?
Room size: small
Category: technical
type: presentation
Budget requirements: {{{budgetrequire}}}
Budget priority: {{{budget}}}

Author[s]: Ina O' Murchu, Hannes Gassert, Andreas Harth.

Contact: Ina O' Murchu <ina.omurchu@deri.org>, Andreas Harth <andreas.harth@deri.org>, Wikimania program team <cfp@wikimedia.org> [OTRS]
Contacted by: JV
(From: Ireland, Switzerland? | available days: any)

Abstract

The free content online encyclopedia contains approximately 1.5 million articles, more than 500,000 of which are published in English, receiving around 50 million hits a day. It has become one of the most important single knowledge sources on the Web. Wikipedia is currently used mainly by humans who search and browse through its HTML user interface optimized for onscreen display. Web crawlers try to work with this affluent body of content as well.

In contrast to web sites targeting online users, data offered in a machine-understandable format is free from any constraints - they can be processed, integrated, combined and mapped and to different system and vocabularies with ease. In contrast to HTML, such data that is much more useful to software than it is to humans, but multiplies the potential of information it encodes.

What is available at the moment is a documented database scheme, which is suboptimal for information exchange across sites. Semi-structured data (RDF, XML) can be self-describing and can carry its schema implicitly in the data, which facilitates data exchange and integration. The current data set in Wikipedia is not generally machine-processable, but by making the data in Wikipedia machine-processable could open Wikipedia to a broad range of use cases and data consuming agents, one of which could be adding Wikipedia articles to search results, a goal big players like Yahoo and Google are aiming at as well.

One means of making Wikipedia machine-understandable is through the use of a formal ontology.

Ontologies are formal specification of how to represent the entities in specific area and the interrelations among them. In the Semantic Web, Ontologies can be used to share and reuse knowledge via the Web and they can be seen as a means for knowledge management on a global scale. A specific Wikipedia ontology can be built to integrate Wikipedia into the Semantic Web framework and make Wikipedia machine-processable and -understandable indeed. Through the use of RDF (Resource Description Framework, an W3C Recommendation) and URIs Wikipedia content be could identified, described, linked and combined with other data sources.

For example, Wikipedia URIs can be used to denote a subject of a document, or to annotate photos: In fact, Wikipedia URLs can become general URIs identifying concepts in the Semantic Web, enabling the Semantic Web community to leverage the structured knowledge collected and maintained by the Wikipedia. In that sense, ontologizing and "RDFizing" Wikipedia can build a bridge between these two highly productive communities and allow for all sorts of "cross-pollination" between them,

RDF is a language for representing information about resources on the Web it is intended in particular for representing metadata about Web resources such as title, creator, and date, but as the border between data and metadata is blurring, expressing both content and structure of the entire encyclopedia becomes workable - and desirable. RDF is particularly intended for software applications rather than being displayed to people, and provides a common graph-based data model so the information can be exchanged between applications without any loss of meaning.

People using a Wikipedia ontology could reuse the data in different application scenarios as people have easy access to Wikipedia for various software programs through the use of an ontology which is extendable, non proprietary and interoperable across the Internet.

We propose to use an Ontology to describe the schema of the Wikipedia dataset in the Web Ontology Language (OWL). In this paper, we describe the main concepts and relations in our proposed Ontology, derived from both the HTML rendering and the original relational data. The authors present a method how to convert the instances into a format adhering to the Ontology, alongside with a PHP5 implementation of such a converter, relying heavily on regular expressions. After sketching how the converted data can be integrated with other datasets such as WordNet or the world of FOAF, we finally discuss how to display and browse the integrated dataset and discuss our practical experiences and lessons learned in developing "RDFizers" for large-scale interconnected knowledge bases.

About the author[s]:

Murchu, Harth: Digital Enterprise Research Institute, http://deri.org/
Gassert: Mediagonal AG, http://www.mediagonal.ch/

Status information in the templates is not up to date. Please see Internal:Speakers/Categories for final status information.

accept: JV (good combination with MK2)
reject: EM (one semantic web talk is enough, this one doesn't look particularly interesting), AB (doesn't seem relevant to this audience)
status: