At the 2025 Dataverse Community Meeting in Chapel Hill, Jan Range – doctoral researcher and member of the Management Team – presented several key developments at the intersection of research and data infrastructure. Drawing on his dual perspective as both a researcher and infrastructure developer, he addressed challenges around metadata automation, scalable data transfer, and real-time dataset integration – topics highly relevant for research data management (RDM) in simulation science.
pyDataverse: Community Session and Strategic Refactor
Range led a two-hour session focused on the current state and future direction of pyDataverse, the Python interface for interacting with Dataverse repositories. The session included a technical walkthrough of pyDataverse’s existing capabilities and highlighted practical integration points with automated workflows.
More significantly, Range diagnosed the limitations of the current implementation and initiated a complete refactor of the library. The goals include:
- Modularizing the codebase for better maintainability.
- Enabling support for upcoming Dataverse features.
- Making the library more extensible for domain-specific use cases, including simulation data pipelines.
This initiative received broad support from the community, and a working group has been formed to co-develop and test the refactored version over the coming months.
EasyDataverse: LLM Interface for Metadata Ingestion
He also introduced EasyDataverse, a novel interface powered by large language models. It allows researchers to ingest unstructured content – PDFs, handwritten notes, images – directly into Dataverse. The system extracts and maps relevant metadata automatically, ensuring high-quality, schema-compliant dataset documentation with minimal manual effort.
This is particularly relevant for research environments where:
- Metadata is often incomplete or inconsistently recorded.
- Scientific content exists in mixed formats (handwritten notes, experimental plots, image-based documentation).
- Deposits must comply with curation or publication standards but lack the resources for manual review.
The interface also supports automated pre-review, offering feedback on missing fields or guideline violations before formal submission. This reduces manual validation overhead and supports FAIR-compliant data publication.
Rust-Dataverse: High-Performance Uploads for Large Datasets
In a third technical advancement, Range introduced Rust-Dataverse, a new high-performance library for data uploads. Written in Rust and callable from Python, the library targets large-scale and fault-tolerant data transfers, which are central to simulation-heavy domains like those at SimTech.
Key features include:
- Resumable uploads, with persistent state tracking for interrupted transfers.
- Parallelized data ingestion, improving throughput for large datasets.
- Strict type safety, enhancing robustness in production environments.
The library is designed to complement pyDataverse, enabling hybrid workflows where Python-based tools initiate and manage data transfer via Rust for maximum performance.
Real-Time Data Streaming: Towards Simulation-to-Repository Workflows
One of the most forward-looking parts of Range’s contribution focused on real-time data streaming from sensors and simulation outputs into Dataverse. The concept involves Rust-based socket servers that can accept incoming data via UNIX or TCP sockets, with optional TLS encryption.
This design allows data sources written in any language with socket support – including Fortran, C++, or Python – to stream data directly to Dataverse repositories without needing a domain-specific ingestion library.
A concrete example is the planned integration with FLEXI, a Fortran-based CFD simulation framework. In the proposed setup, FLEXI would transmit simulation results directly to Dataverse as they are generated, bypassing intermediate storage and manual uploads.
A first live proof-of-concept demonstration is scheduled for the Dataverse Community Call in August 2025, under the title "Streaming Sensoric Data and Simulations to Dataverse."
Implications for Research Data Management at SimTech
Range’s contributions reflect a clear alignment between SimTech’s research demands and its infrastructure goals. His dual role enables direct feedback loops: real-world research requirements inform software development, while infrastructure strategy is shaped by actual scientific workflows.
This ensures that tools like EasyDataverse and Rust-Dataverse are not only technically sound but also aligned with how simulation scientists work – locally, at scale, and under real-world constraints.