Some notes on research-compendium

These are my notes while studying the research-compendium concept, which is essentially a bunch of guidelines to produce research that is ‘easily’ reproducible. I’m sure I will have better methods in hand when I comprehend v.py and make progress in my tasks.

The notes are mostly based on https://peerj.com/preprints/3192/, which is recommended as a canonical reading on the concept. Other references are mentioned throughout the text. These notes were prepared a few weeks ago during a foray into Docker. They are neither complete not comprehensive – but will serve as a good refresher of the principle concepts.

  • Landing page : contains several references explaining research-compendium.
  • Principles
    • stick with the prevailing conventions of your peers / scholarly community
    • Keep data, methods and outputs separate, but make sure to unambiguously express the connections between them. The result files should be treated disposable (can be regenerated).
    • Specify computational environment as clearly as possible. Minimally, a text file specifying the version numbers of the software and other critical tools being used.
  • R’s package structure is conducive to organise and share a compendium, for any project.
  • Dynamic documents : essentially like org files or Rmarkdown files i.e. literate programming. Sweave was originally introduced around 2002. However, around 2015 : knittr and rmarkdown made substantial progress and are in general more preferred than using sweave.
  • Shipping data with the packages
    • CRAN : generally less than 5MB. A large percentage of the packages have some form of data. Data should be included if a methods package is being shipped with the analysis.
    • use the piggyback package for attaching large datafiles to github repos.
      • It is convenient to be able to upload a new dataset to be associated with thep package, and this can be accessed with pb_download().
    • “medium’ sized data files can be attached using arkdb
  • Adding a Dockerfile to the compendium
    • containerit : o2r/containerit
    • repo to docker : jupyter/repo2docker
    • Binder : https://mybinder.org
    • Use the holepunch package to make the setup easier.
  • Summarising the folder structure for R packages esque
    • Readme file : self-explanatory and should be as detailed as possible, and preferably include a graphical connection between various components.
    • R/ : Script files with resusable functions go here. If roxygen is used to generate the documentation, then man/ dicrectory is automatically populated with this.
    • analysis/ : analysis scripts and reports. Considering using ascending names in the file names to aid clarity and order eg 001-load.R, 002 -… and so on.
    • The above does not capture the dependencies. Therefore an .Rmd or Makefile (or Makefile.R) can be included to capture the full tree of dependencies. These files control the order of execution.
    • DESCRIPTION file in the project root provides formally structured, machine and human-readable information on authors / project license, software dependenceis and other meta data.
      • when this file is included, the project becomes an installable R package.
    • NAMESPACE: autogenerated file that exports R functions for repeated use.
    • LICENSE : specifying conditions for use /reuse
  • [ ] Drone : CI service that operates on Docker containers. This can be used as a check.
  • Makefiles
    • uses the make language.
    • specifies the relationship between data, the output and the code generating the output.
    • Defines outputs (targets) in terms of inputs (dependencies) and the code necessary to produce them (recipes).
    • Allows rebuilding only the parts that are out of date.
    • the remake package enables write Make like instructions in R.
  • Principles to consider before sharing a research compendium
    • Licensing, Version control, persistence, metadata : main aspects to consider.
    • Archive a specific commit at a repository that issues persistent URL’s eg DOI which are designed to be more persistent than other URL’s. Refere re3data.org for discipline-specific DOI issuing repositories. Using a DOI simplifies citations by allowing the transfer of basic metadata to a central registry (eg CrossRef and Datacite). Doing this ensures that a publicly available snapshot of code exists that can match the results published.
    • CRAN is generally not recommended for research-compendium packages, because it is strict about directory structures and contents of the R packages. It also has a 5MB limit for package data and documentation.
  • Tools and templates
    • devtools
    • rrtools : extends devtools

3 responses on “Some notes on research-compendium”

  1. “which is recommended as a canonical reading on the concept” – by *whom* ? Whose canon are you following and why?

    And providing a link is *not enough* as a reference because you force the reader to follow it just to find out who and what the hell that thing is! It’s not only terribly annoying but it’s also counter-productive for *you* because 1. I won’t follow links just because you failed to cite properly again 2. you have *no control* over what will be at that link in a year (or 10 or 20) from now and when it changes, your text becomes all of a sudden nonsense.

    1. I’m unable to find the exact reference which mentioned this paper as a canonical reading, though I remember reading it clearly, which I guess validates your comment by itself. I need to improve this. Actually, I had setup a standard citation approach via Scimax (Emacs) long ago, but disabled it at some point as I was not reading too many papers. I will look into this, and that should resolve the problem.

      I believe my starting point was the collaborative notes in this RopenSci community call, and I started with Karthik Ram’s talk iirc.

      The correct citation for the referenced pre-print article is :
      Marwick B, Boettiger C, Mullen L. 2018. Packaging data analytical work reproducibly using R (and friends) PeerJ Preprints 6:e3192v2 https://doi.org/10.7287/peerj.preprints.3192v2

      However, the final article was also published, and should have been the correct reading. That is:
      Ben Marwick, Carl Boettiger & Lincoln Mullen (2018) Packaging Data Analytical Work Reproducibly Using R (and Friends), The American Statistician, 72:1, 80-88, DOI: 10.1080/00031305.2017.1375986

Leave a Reply

Your email address will not be published. Required fields are marked *