Tutorial "Language Resources and Linked Data" at EKAW'14
"Language Resources and Linked Data" tutorial will be held on November, 25th during the EKAW'14 in Linkönping, Sweden.
Linked Data (LD) is a set of best practices for exposing, sharing, and connecting data on the Web. In the last year, researchers working on linguistic resources showed great interest in publishing their data as LD. Nowadays, there are many good examples involving important organizations and initiatives that stress the opportunities offered by LD and foster the aggregation of multi-language open resources into the Linked Open Data cloud. By interlinking multilingual and open language resources, we foresee a Linguistic Linked Open Data (LLOD) Cloud, a new linguistic ecosystem laying on the Linked Data principles that will allow the open exploitation of such data at global scale. In particular, these are some key benefits of Linguistic LD:
1. Provide enhanced and more sophisticated navigation through multi-language data sets,
2. Increase the visibility of linguistic data,
3. Support easier integration of linguistic information into research documents and other digital objects,
4. Support easier integration of linguistic information with (LOD) datasets, enhancing the natural language description of those datasets,
5. Facilitate re-use across linguistic datasets, thus enriching the description of materials with information coming from outside the organization’s local domain of expertise, and
6. Describe language resources in RDF and make them indexable by standard semantic search engines, and
7. Allow developers and vendors to avoid being tied to domain-specific data formats and dedicated APIs.
In this tutorial, we aim at establishing the basis for exposing language resources in the LLOD. In particular, the tutorial will tackle the following questions:
- How to represent rich multilingual lexical information (beyond rdfs:label) and associate it to ontologies and linked data?
- How to represent and publish multilingual texts, annotations and corpora as linked data?
- How to generate multilingual linked data from data silos?
- How to perform word sense disambiguation and entity linking of multilingual linked data?
We will try to answer them in a practical way, by means of examples and hands-on exercises. The tutorial will be organized in five sections, each one covering one of such topics. Each section will be divided in a theoretical introduction and a practical session. The practical work will consist in completing some short guided examples proposed by the speakers. All the instructional material, data and software required to follow the session will be available online beforehand in the tutorial webpage.
The attendees have to be familiar with the basic notions of RDF and OWL. No previous experience on linked data publication is required. No prior knowledge on NLP techniques or computational linguistics is required.
The audience should consist of PhD students, post-docs, or industry people working with language resources and NLP applications that aim at:
- Exposing linguistic resources as Linguistic Linked Open Data or Linguistic Licensed Linked Data
- Exploring how the LOD cloud represents a valuable resource for multilingual NLP.
We do not assume that the audience has some experience with ontology languages.
Duration of the tutorial:
Full day (November, 25th, 2014)
9:00 - 9:10 Welcome and introduction
9:10 - 9:30 Session 1. Foundations (slides)
10:30 - 11:00 coffee break
11:00 - 11:45 Session 2. Modelling lexical resources on the Web of Data: the lemon model (hands-on)
11:45 - 12:30 Session 3. Methodology and tools for Multilingual Linked Data generation (intro + hands-on) (slides) (slides)
12:30 - 13:30 lunch break
13:30 - 14:00 Session 3. Methodology and tools for Multilingual Linked Data generation (hands-on) (slides)
15:00 - 15:30 coffee break
15:30 - 15:45 Session 4. Integrating NLP with Linked Data and RDF: the NIF format (hands-on)
Here is an overview of the content, aims and methodology of the different sections of the tutorial:
This topic will provide the necessary foundations in Linked Data (LD) and knowledge representation on the Web (ontologies, RDF, SPARQL, etc.) in order to undertake practical work in the rest of the tutorial.
2. Modelling lexical resources on the Web of Data: the lemon model
lemon (Lexicon Model for Ontologies) is an extendedly used model for representing lexical information (words, part-of-speech, grammatical properties, etc) relative to ontologies on the Web. We will present the usage of the model and show some of its practical applications. After presenting the lemon model, we will do a hands-on exercise in modelling a small lexicon for a popular vocabulary, e.g. FOAF.
In addition, ATOLL (A framework for the automatic induction of multilingual ontology lexica) will be presented. We will show how it can be configured to induce lexica in three different languages (German, English, Spanish). In a hands-on session we will provide participants with the opportunity to create a lexicon for their favourite ontology in three languages
3. Methodology and tools for Multilingual Linked Data generation
In this session we will focus on the process of generating and publishing LD from a domain specific multilingual resource. The goal is to present the main activities of the LD generation process and some tools that can be used for multilingual LD generation and linking. An extension of the lemon model to support the representation of linguistic translations will be shown.
In the hands-on session, we will guide the participants in completing the creation of a linked data version of a multilingual resource (e.g., a bilingual dictionary). The session will provide a series of simple steps and best practices and will deal with: (1) the usage of a generic tool such as Open Refine to analyze, clean and transform the source data into RDF, mapping it to existing vocabularies such as Lemon, (2) discover links to other Linked Data resources using a reconciliation service through SPARQL, (3) the main tasks and mechanisms to publish on the Web the data produced using a Linked Data front-end and a SPARQL endpoint, and (4) using SPARQL query the data that has been made available.
4. Integrating NLP with Linked Data and RDF: the NIF format
The NLP Interchange Format (NIF) is an RDF/OWL-based format that provides a LD-enabled URI Scheme and an eco system of tools to represent and publish texts, annotations and corpora on the Web of Data in an interoperable way using ontologies such as OLiA, NERD, MARL, lemon, ITS 2.0 and DBpedia. We will present existing NIF corpora and their benefits to the training of NER systems. We will also present existing NLP tools and services with support for NIF.
In the hands-on session, we will show participants how to convert to NIF, how to build NIF services and use the converted RDF resources for validation, linking, querying and reasoning.
5. Multilingual Word Sense Disambiguation and Entity Linking on the Web based on BabelNet (lexical and encyclopedic language resource) (90 minutes)
The current language explosion on the Web requires the ability to automatically analyze and understand text written in any language. This task however is affected by the lexical ambiguity of language, an issue addressed by two key tasks: Multilingual Word Sense Disambiguation (WSD), aimed at assigning meanings to word occurrences within text, and Entity Linking (EL), a recent task focused on finding mentions of entities within text and linking them to a knowledge base. The main differences between WSD and EL are in the inventory used (dictionary vs. encyclopedia), and the assumption that the mention is complete or potentially incomplete, respectively. In this session we will present the most recent work on the topic.
In a hands-on session, we will present Babelfy, a unified, multilingual WSD and EL system based on BabelNet, and show participants how to disambiguate and link text written in different languages, also producing multilingual linked data as output.
Jorge Gracia, postdoctoral researcher in the Artificial Intelligence Department of Universidad Politécnica de Madrid. His research interests include ontology matching, semantic measures, and multilinguality on the Semantic Web. He co-organised the tutorial on “Enriching the Web with Ontology- lexica” at LREC’12 and “Building the Multilingual Web of Data: A Hands-on tutorial”, at ISWC-14.
Daniel Vila-Suero, PhD candidate at Universidad Politécnica de Madrid. His main research areas are: Linked Data, Digital Libraries and multilingual data management and integration on the Web.
John McCrae, postdoctoral researcher at the Semantic Computing Group at CITEC in the University of Bielefeld. He was involved in the creation of the lemon model and through activities in the W3C Ontolex CG and OKFN Working Group on Open Linguistics has been involved in the creation of the LLOD cloud.
Tiziano Flati is a Ph.D. student in the Department of Computer Science at the Sapienza University of Rome. His research interests lie in the fields of Word Sense Disambiguation and Induction, taxonomy induction and large-scale data mining.
Milan Dojchinovski, research assistant, lecturer and a PhD candidate at the Czech Technical University (CTU) in Prague. His main research interest include Linked Data and Web Services enhanced Linked Data aspects, Linked Data based Recommenders, NLP and Semantic Web technologies in general. He has been involved in the LinkedTV and LOD2 projects funded by the European FP7 programme, and several other projects funded by the CTU.