Scientific reproducibility is in crisis in multiple disciplines, e.g., social sciences, natural sciences and biomedical research among many others. This crisis has been highlighted by a growing number of community and government initiatives, for example the European Union Open Research Data, the US National Institutes of Health (NIH) “Rigor and Reproducibility” guidelines, Research Data Alliance (RDA), Force11, and DataONE projects. Semantic web, together with provenance, software and method metadata play a central role in facilitating scientific reproducibility.
The objective of this tutorial is to provide a landscape of the tools, standards and guidelines that an author should follow to make their work (i.e., data, software, methods, provenance and context) reproducible.
The SPSR tutorial will conclude with a discussion of open challenges in scientific reproducibility and the potential role of semantic web research in addressing these challenges.
This 3-hour tutorial will be split into three parts:
We start our tutorial by introducing a succinct terminology to characterize reproducibility based on a review of the various definitions provided in the literature. Specifically, we retained four terms that cover the different levels of reproducibility that can be achieved and/or desired, which we describe hereafter.
In addition, we will introduce notions of provenance using the W7 model involving Why, Who, When, Where, Which, Who, How. We will conclude by presenting the W3C PROV model in details to demonstrate the critical role of provenance-based solutions in scientific reproducibility.
In the second part, we present provenance-based solutions that have been developed by the semantic web community for fostering the reproducibility of scientific research. Such solutions cover a large spectrum of uses that facilitate scientific reproducibility. For example, provenance can be used for partially automating the annotation, summarization and reporting of experiment results. Indeed, provenance has been used as a means for inferring annotations on the artifacts (data) used and generated by experiments, as well as partially automating the reporting the results of the experiment.
Provenance can be used also for comparison of experiments. In particular, provenance has been used to compare the executions of experiments that are instances of the same pipeline (workflows). This can be used to check (at least) the repeatability of experiments. Provenance can also be utilized for debugging experiments by helping users identifying the pinpointing the steps (modules) and/or datasets that are responsible for errors in computational experiments. Provenance has also been used in versioning experiments, in reducing the time and resources required for experiment re-execution using smart reruns, and to reuse experiments or part thereof.
We will present in the tutorial concrete examples that showcase the above uses of provenance, and discuss how they promote scientific reproducibility. As well as exposing the used of provenance for facilitating reproducibility, we will also discuss proposals/initiatives that seeks to automatically record provenance for reproducibility purposes, e.g., ReproZip, and ResearchObjects.
The last part of the tutorial will focus on drawing a map of the state of the art (presented in the second part), highlighting how these solutions cater for specific aspects of reproducibility. We will also discuss the role of provenance in international reproducibility-related efforts with a particular focus on FAIR, RDA, DataONE. Finally, we will discuss open issues (opportunities and challenges) that need to be tackled and the instrumental role that the semantic web community can play in this critical aspect of scientific research.
The three learning outcomes of the tutorial are: