Over the last few decades, many scientific fields such as biomedicine or climate research have been confronted with vast and continuously growing amounts of data. However, not data gathering is the grand challenge anymore but its analysis. Both the sheer amount of data and its complexity pose significant problems. Today, local solutions are not feasible anymore and large-scale experiments are carried out on powerful server infrastructure as scientific workflows consisting of data transformation and analysis operations. Running such workflows can take hours, days, or even weeks. Misconfiguration, erroneous scripts, and non-converging operations are highly problematic in this respect, as re-running the workflow is costly both in time and money. Moreover, these workflows are created, administered, and changed by potentially large and spatially separated consortia of involved researchers. Due to this complexity, it becomes increasingly hard to gain an overview of all processing steps involved and to trace who has changed what at which place and caused which changes in (intermediate) results. In many contexts, reproducibility of results generated by complex scientific workflows is crucial. However, a recent study showed that it was not possible to confirm findings of almost 90% of over 50 cancer genomics studies. Thus, developing novel approaches that realize traceability and reproducibility is of utmost importance.
The key to traceability and reproducibility lies in the collection of information about the processed data, the applied operations, and their parameters over time. Modern scientific workflow tools provide analytical provenance, but are mostly restricted to scenarios where a single static input dataset results in a single output dataset. With changes occurring at the level of the input data, the workflow itself, and also its parameterization, it is hard and tedious—if even possible—to find out which changes actually caused variations in the output using current technology.
The primary goal of our project is to realize provenance at all levels, allowing analysts to gain a deeper understanding of the workflow, changes applied to it, and how they influence the results. This will be achieved by developing a visual forensic tool for scientific workflows, which includes novel visual analysis methods that allow for a scalable visualization of the workflow and its changes, a visual comparison of complex data structures, and novel change metrics needed to quantify changes in complex data structures.
The methods we develop will help address the issue of reproducibility in published results, which has plagued many scientific communities. Investigators can use our methods to make all or parts of it public, traceable, and reproducible. The provenance visualization and query tools will make it straightforward for scientists to offer a comprehensive description of the analyses performed to obtain their results.
The project (P 27975-NBL) is funded by the Austrian Science Fund (FWF).