Internal:Speakers/CY1

From Wikimania

CY1: Wikipedia as Multilingual Comparable Corpora

Title: Wikipedia as Multilingual Comparable Corpora (ID CY1)

  • Language: English | License: ? (ask them for GFDL/CC?)
  • Room size:
  • Category: Languages
  • type: paper only
  • Budget requirements: {{{budgetrequire}}}
  • Budget priority: {{{budget}}}

Author[s]: Changhua Yang, Hsin-Hsi Chen.

  • Contact: {d91013, hhchen}@csie.ntu.edu.tw [OTRS]
  • Contacted by: ?
  • (From: Taiwan | available days: )
Abstract

This paper explores the possibilities to employ Wikipedia as multilingual comparable corpora for the communities of computational linguistics or linguistics. In comparison with the attempt of using the Web as language corpora, Wikipedia possesses more qualified language contents, and maintain the same up-to-date availability as well as the real-time accessibility. While the multilingual phenomena are implicitly embedded in the Web, Wikipedia explicitly differentiates varieties of languages into many sub-sites. Articles in different languages such as English, German, Japanese, or Chinese could be associated in a pairwise way by tracing their interlingual links, and further grouped by merging the links into clusters cross-linguistically.

In our works, titles and articles are extracted from four Wikipedia sub-sites ([en, de, ja, and zh].wikipedia.org). Clusters of titles and articles, as shown in Table 1, form two kinds of comparable language resources including the multilingual title corpora and the multilingual text corpora. With abundant information from these corpora, we focus on entities that are names of people, locations, and organizations. In a human language processing environment, such named entities are usually not listed in traditional dictionaries. Special treatments such as Named Entity Recognition (NER) methods were proposed to overcome the coverage problem. Depressingly, to recognize a translated/transliterated foreign name is even harder than to recognize a native one. Fortunately, Wikipedia seems not to suffer from the problem because all names are translated/transliterated manually into other languages. We take this case to examine the possibility of extracting Wikipedia knowledge to improve cross-lingual NER efficiencies.

About the author[s]: Department of Computer Science and Information Engineering


Status information in the templates is not up to date. Please see Internal:Speakers/Categories for final status information.


  • accept: JV (very intersting), AB, PD
  • reject:
  • status: