Scientific reproducibility is in crisis in multiple disciplines, e.g., social sciences, natural sciences and biomedical research among many others. This crisis has been highlighted by a growing number of community and government initiatives, for example the European Union Open Research Data, the US National Institutes of Health (NIH) “Rigor and Reproducibility” guidelines, Research Data Alliance (RDA), Force11, and DataONE projects. Semantic web, together with provenance, software and method metadata play a central role in facilitating scientific reproducibility.
The objective of this tutorial is to provide a landscape of the tools, standards and guidelines that an author should follow to make their work (i.e., data, software, methods, provenance and context) reproducible.
The SPSR tutorial will conclude with a discussion of open challenges in scientific reproducibility and the potential role of semantic web research in addressing these challenges.
This 3-hour tutorial will be split into three parts:
We start our tutorial by introducing a succinct terminology to characterize reproducibility based on a review of the various definitions provided in the literature. Specifically, we retained four terms that cover the different levels of reproducibility that can be achieved and/or desired, which we describe hereafter.
In addition, we will introduce notions of provenance using the W7 model involving Why, Who, When, Where, Which, Who, How. We will conclude by presenting the W3C PROV model in details to demonstrate the critical role of provenance-based solutions in scientific reproducibility.
In the second part, we present provenance-based solutions that have been developed by the semantic web community for fostering the reproducibility of scientific research. Such solutions cover a large spectrum of uses that facilitate scientific reproducibility. For example, provenance can be used for partially automating the annotation, summarization and reporting of experiment results. Indeed, provenance has been used as a means for inferring annotations on the artifacts (data) used and generated by experiments, as well as partially automating the reporting the results of the experiment.
Provenance can be used also for comparison of experiments. In particular, provenance has been used to compare the executions of experiments that are instances of the same pipeline (workflows). This can be used to check (at least) the repeatability of experiments. Provenance can also be utilized for debugging experiments by helping users identifying the pinpointing the steps (modules) and/or datasets that are responsible for errors in computational experiments. Provenance has also been used in versioning experiments, in reducing the time and resources required for experiment re-execution using smart reruns, and to reuse experiments or part thereof.
We will present in the tutorial concrete examples that showcase the above uses of provenance, and discuss how they promote scientific reproducibility. As well as exposing the used of provenance for facilitating reproducibility, we will also discuss proposals/initiatives that seeks to automatically record provenance for reproducibility purposes, e.g., ReproZip, and ResearchObjects.
The last part of the tutorial will focus on drawing a map of the state of the art (presented in the second part), highlighting how these solutions cater for specific aspects of reproducibility. We will also discuss the role of provenance in international reproducibility-related efforts with a particular focus on FAIR, RDA, DataONE. Finally, we will discuss open issues (opportunities and challenges) that need to be tackled and the instrumental role that the semantic web community can play in this critical aspect of scientific research.
The three learning outcomes of the tutorial are:
Case Western Reserve University
Satya S. Sahoo is Associate Professor at the Case Western Reserve University (CWRU) in Cleveland, OH, USA. Satya's research focuses on semantic Web including: (1) ontology engineering (upper-level reference ontologies to application/domain-specific ontologies), (2) ontology-driven data integration and query optimization, and (3) provenance metadata management. His current research projects include the Provenance for Clinical and Health Research (ProvCaRe) project, which has developed an ontology-driven Natural Language Processing workflow to extract provenance metadata from all 1.6 million full-text articles in the PubMed repository. The ProvCaRe project currently hosts the largest repository of biomedical provenance metadata consisting of 166 million provenance triples for query and analysis. Satya has served as member of W3C working group to define the new provenance standard called PROV and is co-editor of the PROV-O specifications. His research has been funded by the US National Institutes of Health (NIH) Big Data to Knowledge Provenance initiative
University Paris-Dauphine
Khalid Belhajjame is an Associate Professor at the University Paris-Dauphine. Before moving to Paris, he has been a researcher for several years at the University of Manchester, and prior to that a Ph.D. student at the University of Grenoble. His research interests lie in the areas of information and knowledge management. He made key contributions in the areas of pay-as-you-go data integration, e-Science, scientific workflow management, provenance tracking and exploitation, and semantic web services. He has published over 60 papers in the aforementioned topics. Most of his research proposals were validated against real-world applications from the fields of astronomy, biodiversity and life sciences. He is member of the editorial board of the Data in Brief and XMethod Elsevier journals, has participated in multiple European-, French- and UK-funded projects, and has been an active member of the W3C Provenance working group and the NSF funded DataONE working group on scientific workflows and provenance.