ISWC 2019 Tutorial

Semantic Web and Provenance for Scientific Reproducibility (SPSR)

Tutorial co-located with ISWC 2019

Scientific reproducibility is in crisis in multiple disciplines, e.g., social sciences, natural sciences and biomedical research among many others. This crisis has been highlighted by a growing number of community and government initiatives, for example the European Union Open Research Data, the US National Institutes of Health (NIH) “Rigor and Reproducibility” guidelines, Research Data Alliance (RDA), Force11, and DataONE projects. Semantic web, together with provenance, software and method metadata play a central role in facilitating scientific reproducibility.

The objective of this tutorial is to provide a landscape of the tools, standards and guidelines that an author should follow to make their work (i.e., data, software, methods, provenance and context) reproducible.

The first part of the tutorial will introduce a framework for different levels of reproducibility desired in scientific research and underline the challenges involved in achieving each of them.

In the second part, we will describe the role of semantic web standards, including the W3C PROV specifications in scientific reproducibility.

Finally, in the third part of the tutorial we will present real world examples of provenance-enabled scientific reproducibility projects. For example, the ProvCaRe project with the largest repository of semantic provenance extracted from biomedical literature for evaluating reproducibility, OntoSoft for tracking software metadata, Research Objects for packaging and annotating scientific research outputs, and W2Share for converting scripts into reproducible workflows.

The SPSR tutorial will conclude with a discussion of open challenges in scientific reproducibility and the potential role of semantic web research in addressing these challenges.

Tutorial Schedule

This 3-hour tutorial will be split into three parts:

Part I. Reproducibility and Provenance Primer

We start our tutorial by introducing a succinct terminology to characterize reproducibility based on a review of the various definitions provided in the literature. Specifically, we retained four terms that cover the different levels of reproducibility that can be achieved and/or desired, which we describe hereafter.

Repeat. An experiment is said to be repeated when it is performed in the same lab (or computation environment) as the original experiment, that is, on the same scientific environment. The major goal of the repeat task is to check whether the initial experiment was correct and can be performed again.
Replicate. An experiment is said to be replicated when it is performed in a different lab (or computational environment) than the original experiment. When replicated, a result has a high level of robustness compared with the repeat case: the result remains valid in a similar (even though different) setting.
Reproduce. Reproduce is defined in the broadest possible sense of the term and denotes the situation where an experiment is performed within a different set-up but with the aim to validate the same scientific hypothesis. In other words, completely different approaches can be designed, completely different data sets can be used, as long as both experiments converge to the same scientific conclusion.
Reuse. A last very important concept related to reproducibility is Reuse which denotes the case where a different experiment is performed, with similarities with an original experiment. A specific kind of reuse occurs when a single experiment is reused in a new context (and thus adapted to new needs), the experiment is then said to be repurposed.

In addition, we will introduce notions of provenance using the W7 model involving Why, Who, When, Where, Which, Who, How. We will conclude by presenting the W3C PROV model in details to demonstrate the critical role of provenance-based solutions in scientific reproducibility.

Part II. W3C PROV-Based Solutions for Facilitating Reproducibility

In the second part, we present provenance-based solutions that have been developed by the semantic web community for fostering the reproducibility of scientific research. Such solutions cover a large spectrum of uses that facilitate scientific reproducibility. For example, provenance can be used for partially automating the annotation, summarization and reporting of experiment results. Indeed, provenance has been used as a means for inferring annotations on the artifacts (data) used and generated by experiments, as well as partially automating the reporting the results of the experiment.

Provenance can be used also for comparison of experiments. In particular, provenance has been used to compare the executions of experiments that are instances of the same pipeline (workflows). This can be used to check (at least) the repeatability of experiments. Provenance can also be utilized for debugging experiments by helping users identifying the pinpointing the steps (modules) and/or datasets that are responsible for errors in computational experiments. Provenance has also been used in versioning experiments, in reducing the time and resources required for experiment re-execution using smart reruns, and to reuse experiments or part thereof.

We will present in the tutorial concrete examples that showcase the above uses of provenance, and discuss how they promote scientific reproducibility. As well as exposing the used of provenance for facilitating reproducibility, we will also discuss proposals/initiatives that seeks to automatically record provenance for reproducibility purposes, e.g., ReproZip, and ResearchObjects.

Part III. Where Are We Today? Challenges in Scientific Reproducibility

The last part of the tutorial will focus on drawing a map of the state of the art (presented in the second part), highlighting how these solutions cater for specific aspects of reproducibility. We will also discuss the role of provenance in international reproducibility-related efforts with a particular focus on FAIR, RDA, DataONE. Finally, we will discuss open issues (opportunities and challenges) that need to be tackled and the instrumental role that the semantic web community can play in this critical aspect of scientific research.

What will you learn?

The three learning outcomes of the tutorial are:

Gain an understanding of the role of semantic web standards and in particular provenance metadata in enabling scientific reproducibility through the paradigm of Repeat, Replicate, Reproduce, Reuse.

Gain an understanding of the role of W3C PROV specifications in supporting scientific reproducibility. In particular, the attendees will gain insight into the ongoing projects in scientific reproducibility using provenance and semantic web standards.

Appreciate the challenges facing scientific reproducibility and the open issues that are opportunities for the semantic web community to lead the development of solutions for reproducibility.

Tutorial Organizers

Satya Sahoo

Case Western Reserve University

Satya S. Sahoo is Associate Professor at the Case Western Reserve University (CWRU) in Cleveland, OH, USA. Satya's research focuses on semantic Web including: (1) ontology engineering (upper-level reference ontologies to application/domain-specific ontologies), (2) ontology-driven data integration and query optimization, and (3) provenance metadata management. His current research projects include the Provenance for Clinical and Health Research (ProvCaRe) project, which has developed an ontology-driven Natural Language Processing workflow to extract provenance metadata from all 1.6 million full-text articles in the PubMed repository. The ProvCaRe project currently hosts the largest repository of biomedical provenance metadata consisting of 166 million provenance triples for query and analysis. Satya has served as member of W3C working group to define the new provenance standard called PROV and is co-editor of the PROV-O specifications. His research has been funded by the US National Institutes of Health (NIH) Big Data to Knowledge Provenance initiative

Khalid Belhajjame

University Paris-Dauphine

Khalid Belhajjame is an Associate Professor at the University Paris-Dauphine. Before moving to Paris, he has been a researcher for several years at the University of Manchester, and prior to that a Ph.D. student at the University of Grenoble. His research interests lie in the areas of information and knowledge management. He made key contributions in the areas of pay-as-you-go data integration, e-Science, scientific workflow management, provenance tracking and exploitation, and semantic web services. He has published over 60 papers in the aforementioned topics. Most of his research proposals were validated against real-world applications from the fields of astronomy, biodiversity and life sciences. He is member of the editorial board of the Data in Brief and XMethod Elsevier journals, has participated in multiple European-, French- and UK-funded projects, and has been an active member of the W3C Provenance working group and the NSF funded DataONE working group on scientific workflows and provenance.