In a study, Sibylle Hermann, Data and Software Steward at SimTech, examined the fact that although scientists document data, they often do not know exactly what they actually have to document and how. In this interview, she talks about why research data management is so important, what issues are relevant, and why many legal aspects still need to be clarified. The interview has been edited for clarity and length.
Scientific findings can be found in research papers. So why do we need research data management?
When you publish a paper and describe the results, the data, software, and scripts that were used are not in it. You might get them on request. Research data management means that the underlying data is also published, so that the research, which is financed by public funds, can be reused. There have been cases where scientists wanted to understand what colleagues had done and then realized that the result was not correct. However, in order to comply with good scientific practice, it is important that these results can be reproduced. Until a few years ago, research data only had to be archived for ten years, in any place and in any format. This requirement has now been extended to include the documentation and publication of data and software. Otherwise, almost all results found in a scientific paper won’t be reproducible.
Research results cannot always be made publicly available. Sometimes there are legal reasons against this, for example, copyright or confidentiality obligations. However, if the results are made publicly available, then the underlying research data, materials and information, the methods applied, and the software used should also be made available. The work processes should be comprehensively described, even the source code of software. This is stated in Guideline 13 of the "Guidelines for Safeguarding Good Scientific Practice" of the German Research Foundation (DFG). All higher education institutions and scientific research institutions must have implemented this code in a legally binding manner by July 31, 2023, if they wish to continue receiving funding from the DFG.
That sounds alarming. Why are so many results not reproducible?
Methods and findings are described in a scientific paper, but the data itself is not published. But in case of a simulation, you have to be able to reprogram what someone did before. The idea of reproducibility is that you can download this data, receive it on request, or find it published in full. Reproducibility then means that you can use the data to do the computation again and check whether you arrive at the same result. At SimTech, we go one step further with the concept of reusability: the aim is that partial knowledge, too, can be reused in order to be able to compute with your own environments.
The GO FAIR initiative is committed to ensuring that research data is processed in such a way that it can be reused by humans and machines. To this end, the data must be findable, accessible, interoperable, and reusable. There are restrictions due to patent applications or software licenses, for example.
So reproducibility means I just reproduce what someone else has done. With reusability, you want to be able to apply to your own research what someone else did before.
And how do you get this data?
More and more research data is now available in digital form. The idea is to describe this research data in such a way that it can be found in the first place. This is the first of the four FAIR principles and concerns infrastructure, archives, and repositories. But then the question is, how can I publish the data so that I can find it and also reuse it technically and legally? This is an aspect of research data management and good scientific practice, which actually requires it.
Does this mean that anyone can use the data freely?
If they were published freely, anyone can use them. You have to understand them, of course, that's the basic requirement (laughs). There are very specific cases of application at SimTech, in particular. It's not as if you can take a hugely complex simulation model and, as a layperson, understand what was done there. This data is intended more for scientific purposes. But sometimes there are other purposes that scientists themselves don't even see at first.
The idea behind this is also that research is paid for with taxpayers' money and, therefore, the results should be available to everyone, while barriers such as possible confidentiality obligations must be taken into account. But especially in projects that are financed by public funds, for example with DFG funding, it’s a great advantage that you don't have to start the research from scratch every time, but that the new research can be based on the old research, and that you can reuse the data.
How does copyright apply if the data is used for commercial purposes?
This depends on the underlying license agreement for this data. There are Creative Commons licenses that are supposed to simplify and standardize everything. The DFG, for example, requires the CC BY license for data.
Creative Commons (CC) is a non-profit organization that offers ready-made license agreements. They are intended to help authors to release legally protected content and indicate the conditions under which works or data may be reused. Agreements range from simple licenses, such as CC BY, where only the name needs to be mentioned, to licenses with restrictions on editing (ND) or a ban on commercial use (NC).
And what about software?
For software, there are completely different licenses. With a CC BY non-commercial license, for example, you are not allowed to use the data in industry. There are also licenses that are even more restrictive. Alternatively, you can publish your data on request only. In this case, you can then decide for yourself and, for example, only pass data on to scientists and not to industry. Sometimes, on the other hand, you have collaborations with industrial companies and don't want everything to be publicly reusable for reasons of industrial property rights.
With software, for example, it’s often the case that there are copyleft licenses. They allow editing or further development, but only under the original license. This prevents commercial use from leading to restrictions for users. With software, the case is even more complex because software is usually made up of many different software packages. You first have to check the compatibility of these licenses before you create a license from this whole construct of previous software licenses.
Who then owns the data created with a particular software?
Actually, in research data management, the focus is always on the level of creativity, meaning the data must first contain a certain degree of creativity and originality in order to be protected by copyright at all. If you write software and it spits out some data, then you have no copyright on this data. And if you use software from a company and generate some data with it, then the question is: Who owns this data? This is still unclear and very difficult to answer. It also depends on the role in which this data was generated or software was written. Does it belong to you or the university or the head of the institute? This is also very complex from a legal point of view and there is still a lot to be clarified.
You could say, I'll give it a try as long as nobody complains. This is actually an approach that people sometimes take. I just had such a case in the field of digital humanities. This involved copyright-protected works that were annotated by hand at great expense. These annotations can then be used for machine learning. They are predestined for reuse, but there is legal uncertainty, as the underlying work is protected by copyright and the annotations only make sense in conjunction with the text. It can then happen that scientists don't do any research on these works because the legal clarification is so time-consuming. If you can't use this data afterwards, then the research is not worthwhile.
We investigated this in a project. There was one approach from us and one from the University of Trier on how to deal with data mining on such copyrighted works. Then the law changed in the middle of the project and it became even more complex. Under the old copyright law, this data could be archived as a memory institution. This passage has since been deleted and it is unclear who is allowed to archive it. The researcher is allowed to do so, but in the end the lawyers are still arguing about where and how the data can be archived and made available.
What does this mean for working with data for simulations?
At SimTech, we work with data-integrated simulations, which means that a lot of experimental data is processed. We use a lot of data as input for the simulation and not just for the data output. That is interesting. For machine learning, for example, you need a lot of data that you can use for calculations. But someone has to provide you with it first. For a crash simulation, for example, we used data from a university in Washington. The scientists there measured the geometry of a car and published all these data points after years of work. We were only able to perform the crash simulations on the basis of this information. That's how you have to imagine it.
So what is special about research data management at SimTech?
The special thing is that our team also conducts research itself and doesn’t just provide the infrastructure. The other thing is that we develop the infrastructure in a targeted manner that is geared to the needs of the scientists, for example, in joint research projects. We therefore issued a policy throughout the University of Stuttgart that research data management begins with the research and not just with publication. This means that the data is documented as it is created. We have several projects where we offer the option of documenting the workflow, for example with EnzymeML.
What’s also special is that, using the infrastructures of the University Library and the IT Department, we actually curate the data that is published in our repository. Scientists who upload the data will receive feedback from us and can then improve their data.
Persistent identifiers are like ID cards for digital data or other digital objects as well as for persons. They can be clearly identified and retrieved using the numbers and/or alphanumeric characters assigned to them. For data and digital objects, for example, this is the DOI (Digital Object Identifier), for persons it is the ORCID ID.
We also enrich the data. If you publish on a free platform, you only have to enter a basic set of metadata and are provided with a persistent identifier. However, we want to delve deeper into the subject of reusability and, in a first step, ask for a content review from the scientists. My team colleague Jan Range has developed the Easy Review Tool for this purpose. The workflow is then as follows: you submit the data, an admin looks at it before it is released, and it is also formally curated.
This is not yet happening on a large scale at the entire university, but we are already doing it in SimTech. After our release step, the data is then formally curated in the University Library. In future, this will be ever more automated. We are working on this and are developing further tools. We have already published many data sets in SimTech and our aim is to publish the data for every scientific paper. In the best case with a free license.
What would be your vision for research data management?
My vision would be to go toward open simulation. The aim is not only to publish research software and results, but also to make the methods used and thus our research more transparent. We want to achieve sustainable, fully reproducible research.
Read more
HERMANN, Sibylle; FEHR, Jörg. Documenting research software in engineering science. Scientific Reports, 2022, 12. Jg., Nr. 1, S. 6567. https://doi.org/10.1038/s41598-022-10376-9
About the scientist
Sibylle Hermann is a graduate engineer and studied engineering sciences at the University of Stuttgart. She has been working as a research data manager at the Stuttgart University Library since 2015 and has also been responsible for the quality of research data and its provision within the SimTech Cluster of Excellence as a software and data steward since 2019. In her own research and her dissertation, she deals with the documentation of data and software in the engineering sciences. In particular, she is concerned with the creation and underlying data of images in scientific papers, and addresses questions of philosophy of science, for example, the question of how knowledge can be generated from simulations.