“Building the Multilingual Web of Data: A Hands-on tutorial” at ISWC14

"Building The multilingual Web of Data: A Hands-on tutorial" will be held on October 20th during the ISWC 2014 in Riva del Garda - Trentino, Italy.

The multilingual Web of Data can be realized as a layer of services and resources on top of the current Linked Data cloud which adds: linguistic information in different languages, mappings between data with labels in different languages, and services to dynamically access and traverse Linked Data across different languages.

Contributing towards this vision, in this tutorial we provide an overview of ingredients that are crucial to bring the vision of a multilingual web of data into reality.  In particular, the tutorial will tackle the following questions:

  • How to represent rich multilingual lexical information (beyond rdfs:label) and associate it to ontologies and linked data?

  • How to represent and publish multilingual texts, annotations and corpora as linked data?

  • How to generate multilingual linked data from data silos?

  • How to perform word sense disambiguation and entity linking of multilingual linked data?

  • How to apply these techniques to a real use case?

We will try to answer them in a practical way, by means of examples and hands-on exercises. The tutorial will be organised in five sections, each one covering one of such topics. Each section will be divided in a theoretical introduction and a practical session. The practical work will consist in completing some short guided examples proposed by the speakers. All the instructional material required to follow the session will be available online beforehand in the tutorial website. Further, a USB pendrive will be delivered to every participant during the session, containing the data and software needed for the hands-on exercises.

The attendees have to be familiar with the basic notions of RDF and OWL. No previous experience on linked data publication is required. No prior knowledge on NLP techniques or computational linguistics is required.

SCHEDULE

Morning

9:00 - 9:15    Opening: introduction and goals of the tutorial (slides)

9:15 - 10:30    Session 1: Modelling lexical resources on the Web of Data: the lemon model (slides: part 1, part 2)

10:30 - 11:00    Coffee break

11:00 - 11:30     Session 2 (1st part): Integrating NLP with Linked Data and RDF: the NIF format (introduction) (slides)

11:30 - 12:45     Session 3: Multilingual Word Sense Disambiguation and Entity Linking on the Web based on BabelNet (lexical and encyclopedic language resource) (slides)

12:45 - 14:00     Lunch break

Afternoon

14:00 - 14:45    Session 2 (2nd part): Integrating NLP with Linked Data and RDF: the NIF format (hands on) (slides)

14:45 - 15:30   Session 4 (1st part): Methodology and tools for Multilingual Linked Data generation (introduction) (slides)

15:30 - 16:00     Coffee break

16:00 - 16:30     Session 4 (2nd part): Methodology and tools for Multilingual Linked Data generation (hands on) (slides)

16:30 - 17:30    Session 5: Generating multilingual variants for automatically extracted sentiment lexicons: the Eurosentiment use case (slides)

DETAILED PROGRAM

Here is an overview of the content, aims and methodology of the different sections of the tutorial:

  1. Modelling lexical resources on the Web of Data: the lemon model (75 Minutes)

lemon (Lexicon Model for Ontologies) is an extendedly used model for representing lexical information (words, part-of-speech, grammatical properties, etc) relative to ontologies on the Web. We will present the usage of the model and show some of its practical applications. After presenting the lemon model, we will do a hands-on exercise in modelling a small lexicon for DBpedia.

In addition, ATOLL (A framework for the automatic induction of multilingual ontology lexica) will be presented. We will show how it can be configured to induce lexica in three different languages (German, English, Spanish).  

  1. Integrating NLP with Linked Data and RDF: the NIF format (75 minutes)

The NLP Interchange Format (NIF) is an RDF/OWL-based format that provides a LD-enabled URI Scheme and an eco system of tools to represent and publish texts, annotations and corpora on the Web of Data in an interoperable way using ontologies such as OLiA, NERD, MARL, lemon, ITS 2.0 and DBpedia. We will also present existing NIF corpora and their benefits to the training of NER and other NLP tools as well as special tasks like ontology learning.

In the hands-on session, we will show participants how to convert back and forth from common formats such as CoNLL and show how to use the converted RDF resources for validation, linking, querying and reasoning.

  1. Multilingual Word Sense Disambiguation and Entity Linking on the Web based on BabelNet (lexical and encyclopedic language resource) (75 minutes)

The current language explosion on the Web requires the ability to automatically analyze and understand text written in any language. This task however is affected by the lexical ambiguity of language, an issue addressed by two key tasks: Multilingual Word Sense Disambiguation (WSD), aimed at assigning meanings to word occurrences within text, and Entity Linking (EL), a recent task focused on finding mentions of entities within text and linking them to a knowledge base. The main differences between WSD and EL are in the inventory used (dictionary vs. encyclopedia), and the assumption that the mention is complete or potentially incomplete, respectively. In this session we will present the most recent work on the topic.

In a hands-on session, we will present Babelfy, a unified, multilingual WSD and EL system based on BabelNet, and show participants how to disambiguate and link text written in different languages, also producing multilingual linked data as output.

  1. Methodology and tools for Multilingual Linked Data generation (75 minutes)

In this session we will focus on the process of generating and publishing LD from a domain specific multilingual resource. The goal is to present the main activities of the LD generation process and some tools that can be used for multilingual LD generation and linking. An extension of the lemon model to support the representation of linguistic translations will be shown.

In the hands-on session, we will guide the participants in completing the creation of a linked data version of a multilingual resource (e.g., a bilingual dictionary). The session will provide a series of simple steps and best practices and will deal with: (1) the use of tools (e.g. Open refine) to transform the source data into RDF,  (2) discover links to other Linked Data resources and (3) the main tasks and mechanisms to publish on the Web the data produced.

  1. Generating multilingual variants for automatically extracted sentiment lexicons: the Eurosentiment use case (60 minutes)

We present a configurable pipeline for legacy language resource adaptation that generates entity-level domain-specific sentiment lexicons covering sentiment words that occur in the context of these entities. We explain the components of the pipeline and how it can be configured to work for different use cases, including but not only sentiment analysis. The pipeline includes also a component for generating translations, based on a Statistical Machine Translation approach and multilingual links from BabelNet. The resulting lexicons are modelled as Linked Data resources by use of established formats for Linguistic Linked Data and Linked Sentiment Expressions (lemon, Marl).

In the hands-on session the attendees will be able to explore the generated RDF lexicons at http://www.eurosentiment.eu/dataset as well as tryout a set of sample queries on top of this data at http://146.148.28.139/eurosentiment/sparql-demo/.

PRESENTERS

Jorge Gracia [primary contact], postdoctoral researcher in the Artificial Intelligence Department of Universidad Politécnica de Madrid. His research interests include ontology matching, semantic measures, and multilingual Semantic Web. He co-chairs the W3C group on “Best Practises for Multilingual Linked Open Data”. He co-organised the tutorial on “Enriching the Web with Ontology- lexica” at LREC’12 and participates in the tutorial on “Linked Data for Language Technologies” at LREC’14.

Daniel Vila-Suero, PhD candidate at Universidad Politécnica de Madrid. His main research areas are: Linked Data, Digital Libraries and multilingual data management and integration on the Web

John McCrae is a research associate at the Cognitive Interaction Technology Excellence Center at Bielefeld University, where his work has mainly focussed on the ontology-lexicon interface and multilinguality. He received an MSci from Imperial College London in 2006 and a PhD from the National Insitute of Informatics in 2009. Since, then he has been involved in the Monnet Project on Multilingual Ontologies and in the LIDER project.

Sebastian Walter is a doctoral candidate at the Semantic Computing Group at Bielefeld University. He is currently working under the supervision of Philipp Cimiano on the topic of automatically inducing lemon lexica from corpora in multiple languages. He is the main developer of the ATOLL system that will be presented in this tutorial.

Ciro Baron Neto: has started his Ph.D at University of Leipzig in 2014 with AKSW research group. He makes part of the NLP2RFDF group and is involved in several Semantic Web projects such as DBpedia DataID and NLP Interchange Format (NIF).

Martin Brümmer: (AKSW, Universität Leipzig, Germany, bruemmer@informatik.uni-leipzig.de) has started as a researcher at AKSW technology lab in Dec. 2013. He is a contributor to the NLP2RDF and the DBpedia Project and was co-chair of the Multilingual Linked Data for Enterprises (MLODE) 2012 workshop. He contributed to the development of the Linguistic Linked Open Data Cloud (http://linguistics.okfn.org/resources/llod/). His research focus is on Linguistic Linked Open Data, NLP in the Semantic Web and Open Government Data.

Roberto Navigli is an associate professor in the Department of Computer Science at the Sapienza University of Rome. He is the recipient of an ERC Starting Grant in computer science and informatics on multilingual word sense disambiguation (2011-2016) and a co-PI of a Google Focused Research Award on Natural Language Understanding. His research interests lie in the field of Word Sense Disambiguation and Induction, multilingual knowledge acquisition, Open Information Extraction and applications of lexical semantics.

Tiziano Flati is a Ph.D. student in the Department of Computer Science at the Sapienza University of Rome. His research interests lie in the fields of Word Sense Disambiguation and Induction, taxonomy induction and large-scale data mining.

Gabriela Vulcu, Research Associate at  INSIGHT Center for Data Analytics, National University of Ireland, Galway. Her research interests are big data modelling, linguistic linked data and semantic technologies.