1 Signatories

1.1 Project team

Martin Schobben, Department of Geodesy and Geoinformation, Technical University of Vienna, Austria
Ton Smeele, Online Research Labs, Data Management, Utrecht University, Netherlands
Terrell Russell, iRODS Consortium, Chapel Hill, United States of America

A detailed account of the individual team member’s interests and skills can be found in Section 5.1.

2 The Problem

One of the biggest problems in research is the inadvertent destruction of data and the inaccessibility of data due to poor labeling and description of data. This loss of data means that studies cannot be replicated, combined or re-used in different settings (Briney 2015; Wilkinson et al. 2016). Data management becomes more important, but also more challenging, in the age of rapid digital data production. The Integrated Rule-Oriented Data System (iRODS) (Rajasekar et al. 2010, 2015) is an open sourced data management software suite that offers a solution to this demand, which encompasses the whole data life cycle from data generation, storage and recycling. The loosely constructed and highly configurable architecture of iRODS frees the user from strict formatting constrains and single-vendor solutions. Furthermore, the open-sourced and community supported nature of iRODS safeguards data storage and re-usability of data in terms of longevity and independently of future technological innovations.

Nonetheless, the learning curve of how to implement iRODS effectively for day-to-day data management purposes can be steep for the average scientist. Hence there have been multiple incentives to lower this barrier by hiding parts of the behind-the-scenes business logic which requires considerable knowledge of command line tools. A notable example is the UU developed YODA front-end, which simplifies the scientist’s tasks to meet their datamanagement goals. But, it still requires some effort to integrate this data representation with workflows for the analyses of data. Highly popular among academics are higher level programming languages, such as R, that can help construct such data analyses workflows. Currently, there is no straightforward solution to integrate iRODS data storage with R workflows. The development of the R package rirods (Chytracek et al. 2015), build on top of the iRODS C++ API, is currently stale and suffers from strict system requirements constrains, which makes it unsuitable for an easily installable and distributed method. This, furthermore, hinders the publication of rirods on the Comprehensive R Archive Network (CRAN; the official R package archive), as interoperability among operating systems is mandatory for an R package to be eligible for publication.

3 The proposal

3.1 Overview

Recently, a REpresentational State Transfer (REST)full API has been build for iRODS. REST architectures allow connecting services at global scale (Fielding 2000), as it enables clients (users) to interact with their data via a web-based architecture with the common HTTP verbs. HTTP is an interaction protocol for the Web, where verbs like GET, PUT, POST and DELETE can interact with the resource representation through a Uniform Resource Identifier (URI). This allows for a loose coupling between the datamanagment system (iRODS) and the user, which means that the consuming application does not have to be build on the same system, or even on the same operating system. Similarly to data stored on iRODS, using the web means also that the representation of the resource does not need to conform to a stringent format, and can be any, or many, formats (e.g., text, XML, and/or JSON). This user friendliness and lose coupling allows for a radical new approach to bind iRODS to R. We therefore propose to build a new R package for iRODS; rirods, to replace the former package, and, which, allows for a truly distributed computing solution by using iRODS RESTfull API. We foresee that this R package will allow for an integrated R workflow and iRODS data management solution that will aid open practices in e.g., science.

3.2 Detail

In order for the R package to communicate with the REST API of iRODS we make use of the libcurl library, which facilitates transfer of data over HTTP. Conveniently there already exist R wrapper packages around the libcurl library, notably: curl (Ooms 2023) and httr2 (Wickham 2022). This makes it possible to construct a minimal dependency R solution to establish e.g., secure connections through HTTP with TLS and parallel requests (async multi-download functionality). At the side of the iRODS server, the iRODS C++ REST Mid-Tier API will handle the requests and translate them into sensible representations similar to iRODS’ native iCommands. Authentication occurs through an end-point that queries the native authentication module for iRODS, which then returns an Json Web Token that allows connecting with a remote iRODS server. Future versions will also implement support for the PAM stack, which will greatly increase the configurability in terms of the authentication process.

Development: A local iRODS server is build for convenience during development and testing of the package. For this purpose the irods_demo Docker will be used and run as a localhost. In addition, Utrecht University will accommodate a Virtual Machine with an iRODS server to more faithfully mimic the requirements of querying a remote server.

Unit testing: To enable publication of rirods on the CRAN server, unit tests should be able to be executed without internet connection. To achieve this a mocking technique will be used that imitates HTTP responses (httptest2 Richardson 2022) by saving snapshots of the “real” API’s response for future comparison of test results. This process of generating mock snapshots will be automatized by integration in a GitHub Workflow so that changes in the REST API’s response can be accommodated whilst performing unit tests in future stages of rirods development.

Documentation: Technical documentation will be streamlined with the usage of roxygen2 (Wickham, Danenberg, et al. 2022), which produces MAN pages for the newly developed functions that can be queried from the R console. In addition, long format documentation is included in the form of so-called vignettes, which can as well be consulted from the command line. In addition, the function MAN pages and vignettes will be published as a pkgdown website (Wickham, Hesselberth, and Salmon 2022) hosted with GitHub Pages and automatically updated with a designated GitHub Workflow.

4 Project plan

4.1 Start-up phase

The development of rirods will lower the technical barriers of implementing iRODS data management capabilities as and integral part of R data analysis workflows. Thus rirods is meant to be conducive to innovation and foster inclusiveness through open practices in e.g., science. In turn, R package development encompasses soliciting domain-specific use-cases to optimize the design. A close collaboration with the department of Online Research Labs, Data Management at Utrecht University (UU), the Department of Geodesy and Geoinformation at the Technical University of Vienna, the National Institute for Public Health and the Environment of the Netherlands, and the Wageningen University & Research is already foreseen. Nonetheless, the development will benefit from having a good overview of the types of data and data models produced by analytical equipment in a range of academic, governmental and industrial settings. We will opt for an MIT license and a code of conduct, which will follow the Contributor Covenant guide lines. Combined this ensures that contributions to the package can be done in a safe, inclusive, welcoming, and harassment-free environment conducive for collaborative package development, and ensuring down-stream re-usage of the developed software. Reporting of the progress of the project to both users and developers will help ensure that we stay on track and thus develop a solution that has a broad future implementation.

4.2 Technical delivery

The duration of the project will be 8 months. The “deliverable” gives a convenient measure of project’s progress.

Months 1–3

  • Start with basic package set-up with devtools (Wickham, Hester, et al. 2022).
  • Follow best practices from the start of package development; e.g., documenting progress, maintaining a functioning Git main branch and usage of development branches for experimental updates. This will be published on GitHub from the start, and tags are created when milestones are hit to benefit progress tracking. In addition, unit tests are constantly developed to ensure that a particular behavior of a function is, and remains, correct (and also regularly checking code coverage of said tests). Lastly, we test code, portability, and documentation with R CMD check and with continuous integration provided by GitHub Workflows.
  • Deliverable: A GitHub repo with the basis of the package.

Months 4–6

  • Minimal functionality to asses the integration in existing data processing pipelines.
  • Deliverable: A tag in the GitHub repo annotating the milestone of a functioning solution.

Months 7–9

  • Consult user responses to earlier version (e.g. GitHub Issues) and adapt rirods accordingly.
  • Publish on CRAN.
  • Develop teaching/course materials.
  • Deliverable: Installable package on CRAN with documentation as vignettes and website with pkgdown (Wickham, Hesselberth, and Salmon 2022).

Months 10–12

  • Present package at conference(s) targeting users (natural science conference) and/or developers (R or open science conference).
  • Give workshop to inform users about rirods functionality and integration in data processing pipelines.
  • Deliverable: Video material of conference(s) and workshops.

4.3 Other aspects

We will garner attention on the impact of data management practices on open science, and our proposed solutions to mitigate common problems, through several channels. We actively seek to advertise the product by presenting our finding at conferences; either user-specific (science conference) and/or the developer community (e.g. iRODS general user meeting). In addition, we intend to develop course material and give (online) workshops to instruct users on the usage of the R package.

5 Requirements

The realisation of this package requires a collaborative environment that includes the potential users, and their specific requirements for processing analytical data, as well as developers and data scientist with expertise in a range of disciplines. In regards to development, we brought together a multi-disciplinary team, and consulted experts of data management and the R software packaging.

5.1 People

The project team will try to form a comprehensive picture of the current state of data management practices in laboratories through direct interaction with representatives from academia, the public sector and industry. In addition, they take control in all steps of development, documentation and outreach of the package. Dedicated consultants have been contacted and their expertise is regarded as an essential aid for successful deployment of the plan.

The project lead (MS) is an Earth scientist with 10 years of experience in academic research, and he has worked in several analytical laboratory facilities (MfN Berlin, University of Leeds, and Utrecht University). He also has a solid basis in data-analysis and programming with R, and has started developing packages for analysing isotope chemical data (see point). Teaching and helping others to encode R solutions has been another of his passions, such as the development of workshops, and by founding of an R help desk at the UU (uu-code-club).

TR oversees the iRODS development team and handles code review, package management, documentation, and high level architecture design. He’s interested in distributed systems, metadata, security, and open source software that accelerates science. TR holds a Ph.D. in Information Science from the University of North Carolina at Chapel Hill and has been working on iRODS since 2008. In his current role, he also provides management and oversight of the iRODS Consortium.

5.2 Processes

A prime controller in the initiation of the project is early publication of the R package on GitHub. Feedback on these early developments is sought actively through our dedicated list of consultants, but also the community at large through Twitter and other channels. Throughout this process, we will make sure that the code of conduct, as outlined in Section 4.1, is adhered to.

5.3 Tools & Tech

For successful delivery of the package we need a suitable platform for development testing this will be accommodated by using Docker container as well as a Virtual Machine run on a UU server. GitHub is essential for the collaborative character of the work.

5.4 Funding

We request funding for personal costs for Martin Schobben (0.2 fte) to develop, test and role-out of the rirods package.

Table 5.1: Itemized budget.
Item Currency Price
personal cost (M Schobben) 20,000
total 20,000

5.5 Summary

Support is requested for the development, documentation as well as outreach of the rirods package.

6 Success

6.1 Definition of done

The deliverable “Installable package on CRAN” of Section 4.2 defines achievement of the minimal viable product (Solution 1; Section 3.2).

6.2 Measuring success

The actual success during the development phase is measured by the number of contributions and the number of research institutes, laboratories and universities that we can engage with. The success of the developed package is measured by use-cases through download statistics, and for development purposes, by tracking how many packages will integrate this package.

6.3 Future work

It will enable consulting and implementing better resources for data management in academic, governmental and industrial settings as well as help teaching efforts focusing on data management and reproducible practices. In addition, we consider developing course material, give (online) workshops, and continuously advertise usage of the package by actively engaging with target user-base at conferences and on social media.

6.4 Key risks

Problems and delays in terms of coordinating community feedback (especially desired use-cases) and contributions (mainly solutions as listed above) could stem from a lack of consensus on the specific solution to be adopted. Hence we aim to address this at the earliest stages of the project (Months 1–2).

References

Briney, Kristin. 2015. Data Management for Researchers: Organize, Maintain and Share Your Data for Research Success. Pelagic Publishing Ltd.
Chytracek, Radovan, Bernhard Sonderegger, Richard Cote, and Terrell Russell. 2015. “The Rirods Package Enables Access to File Objects in the iRODS Data Broker System from r.” https://github.com/irods/irods_client_library_r_cpp/blob/master/DESCRIPTION.
Fielding, Roy Thomas. 2000. “REST: Architectural Styles and the Design of Network-Based Software Architectures.” Doctoral Dissertation, University of California.
Ooms, Jeroen. 2023. Curl: A Modern and Flexible Web Client for r. https://CRAN.R-project.org/package=curl.
Rajasekar, Arcot, Reagan Moore, Chien-Yi Hou, Christopher A. Lee, Richard Marciano, Antoine de Torcy, Michael Wan, et al. 2010. iRODS Primer: Integrated Rule-Oriented Data System.” Synthesis Lectures on Information Concepts, Retrieval, and Services 2 (1): 1–143. https://doi.org/10.2200/s00233ed1v01y200912icr012.
Rajasekar, Arcot, Terrell Russell, Jason Coposky, Antoine de Torcy, Hao Xu, Michael Wan, Reagan W. Moore, et al. 2015. The integrated Rule-Oriented Data System (iRODS 3.0) Micro-service Workbook.
Richardson, Neal. 2022. Httptest2: Test Helpers for Httr2. https://CRAN.R-project.org/package=httptest2.
Wickham, Hadley. 2022. Httr2: Perform HTTP Requests and Process the Responses. https://CRAN.R-project.org/package=httr2.
Wickham, Hadley, Peter Danenberg, Gábor Csárdi, and Manuel Eugster. 2022. Roxygen2: In-Line Documentation for r. https://CRAN.R-project.org/package=roxygen2.
Wickham, Hadley, Jay Hesselberth, and Maëlle Salmon. 2022. Pkgdown: Make Static HTML Documentation for a Package. https://CRAN.R-project.org/package=pkgdown.
Wickham, Hadley, Jim Hester, Winston Chang, and Jennifer Bryan. 2022. Devtools: Tools to Make Developing r Packages Easier. https://CRAN.R-project.org/package=devtools.
Wilkinson, Mark D, Michel Dumontier, IJsbrand Jan Aalbersberg, Gabrielle Appleton, Myles Axton, Arie Baak, Niklas Blomberg, et al. 2016. Comment: The FAIR Guiding Principles for scientific data management and stewardship.” Scientific Data 3: 1–9. https://doi.org/10.1038/sdata.2016.18.