How Should I Organize My R Research Projects?

My formal training in computer programming consists of two R programming labs required by my first statistics classes, and some JavaScript and database training. That’s about it. Most of my programming knowledge is self-taught.1 For a researcher who does a lot of programming but doesn’t consider programming to be the job, that’s fine… up to a point.

While I understand the languages I need well enough, I don’t know much about programming best practices2. This goes from function naming to code organization, along with all the tools others created to manage projects (git, make, ctabs, etc.). For short scripts and blog posts, this is fine. Even for a research paper where you’re using tools rather than making new ones, this is okay. But when projects start to get big and highly innovative, my lack of knowledge of programming practices starts to bite me in the butt.

I program with R most of the time, and I’m smart enough to program defensively, writing generalizable functions with a reasonable amount of parameterization and that accept other functions as inputs, thus helping compartmentalize my code and allowing easy changing of parameters. But there’s a lot more I can learn, and I have read articles such as

Not surprisingly there is seemingly contradictory advice. This blog post summarizes this advice and ends with a plea for help for what to do.

My Approach So Far (What NOT To Do)

I started the current project I’m working on in early 2016 (merciful heavens, has it been that long?). My advisor didn’t tell me where it was going; he seemed to pitch it as something to work on over the winter break. But it turned into my becoming one of his collaborators (with a former Ph.D. student of his, now a professor at the University of Waterloo), and my taking charge of all things programming (that is, simulations and applications) for a research paper introducing a new statistic for change point analysis (you can read my earlier post) where I mentioned and introduced the topic for the first time).

To anyone out there wondering why academics write such terrible code, let me break it down for you:

  1. Academics often are not trained programmers. They learned programming on their own, enough to solve their research problems.
  2. Academic code often was produced during research. My understanding of professional programming is that often a plan with a project coordinator exists, along with documents coordinating it. While I’m new to research, I don’t think research works in that manner.
  3. It’s hard to plan for research. Research breaks into new territory, without there being an end goal, since we don’t necessarily know where it will end.3

As the project grew in scope the code I would write acquired features like a boat hull acquires barnacles. Here’s a rough description of how my project is structured (brace yourself):

  • The fileChangepointCommon.r contains nearly every important function and variable in my project, save for functions that are used for drawing the final PDF plots of the analysis. Not every file uses everything from ChangepointCommon.r, but it is called via source() frequently. This file has a sister file, ChangepointCommon.cpp, for holding the C++ code that underlies some R functions.
  • A file called powerSimulations2.r is a script that performs all (Monte Carlo) simulations. These simulations are extensive, and I perform them on my school’s supercomputer to take advantage of its 60+ cores and 1TB ram. They simulate our test statistic against multiple similar statistics in a variety of contexts, for the sake of producing power curves at various sample sizes. This script is a derivative of powerSimulations.r, which did similar work but while assuming that the long-run variance of the data was known.
  • Somewhere there is a file that simulated our test statistic and showed that the statistic would converge in distribution to some random variable under a variety of different contexts. I don’t know where this file went, but it’s somewhere. At least I saved the data.
  • But the file that plots the results of these simulations is dist_conv_plots.r. I guess it makes plots, but if I remember right this file is effectively deprecated by another file… or maybe I just haven’t needed to make those plots for a long time. I guess I don’t remember why this exists.
  • There’s a file lrv_est_analysis_parallel.r that effectively does simulations that show that long-run variance estimation is hard, examining the performance of these estimators, which are needed for our test statistics. Again, due to how much simulation I want to do, this is meant to be run on the department’s supercomputer. By the way, none of the files I mentioned above can be run directly from the command line; I’ve been using source() to run them.
  • powerSimulations2.r creates .rda files that contain simulation results; these files need to be combined together, and null-hypothesis rejection rates need to be computed. I used to do this with a file called misc1.R that effectively saved commands I was doing by hand when this job was simple, but then the job soon became very involved with all those files so it turned into an abomination that I hated with a passion. It was just two days ago that I wrote functions that did the work misc1.R did and added those functions to ChangepointCommon.r, then wrote an executable script, powerSimStatDataGenerator.R, that accepted CSV files containing metadata (what files to use, what the corresponding statistical methods are, how to work with those methods) and used those files to generate a data file that would be used later.
  • Data files for particular applications are scattered everywhere and I just have to search for them when I want to work with them. We’ve changed applications but the executable BankTestPvalComputeEW.R works with our most recent (and highlighted) application, taking in two CSV files containing data and doing statistical analyses with them, then spitting the results of those analyses into an .rda (or is it .Rda?) file.
  • Finally, the script texPlotCreation.R takes all these analyses and draws pictures from them. This file includes functions not included in ChangepointCommon.R that are used for generating PDF plots. The plots are saved in a folder called PaperPDFPlots, which I recently archived since we redid all our pictures using a different method but I want to keep the pictures made the old way, yet they keep the same file names.
  • There is no folder hierarchy. This is all stored in a directory called ChangepointResearch. There are 149 files in that directory, of different conventions; to be fair to myself, though, a lot of them were created by LaTeX. There is, of course, the powerSimulations_old directory where I saved the old version of the simulations I did, and the powerSimulations_paper directory where the most recent version are kept. Also, there’s PaperPDFPlots where all the plots were stored; the directory has 280 files, all of them plots.
  • Finally there’s a directory called Notebook. This is a directory containing the attempt of writing a bookdown book that would serve as a research notebook. It’s filled with .Rmd files that contain code I was writing, along with commentary.

In short, I never organized my project well. I remember once needing to sift through someone else’s half-finished project at an old job, and hating the creator of those directories so much. If someone else attempted to recreate my project without me around I bet they would hate me too. I sometimes hate myself when I need to make revisions, as I finished doing a few days ago.

I need to learn how to not let this happen again in the future, and how to properly organize a project–even if it seems like it’s small. Sometimes small projects turn into big ones. In fact, that’s exactly what happened here.

I think part of the rean this project turned out messy was because I learned what literate programming and R Markdown is around the time I started, and I took the ideas too far. I tried to do everything in .Rnw (and then .Rmd) files. While R Markdown and Sweave are great tools, and writing code with the documentation is great, one may go too far with it. First, sharing code among documents without copy/paste (which is bad) is difficult to do. Second, in the wrong hands, one can think there’s not much need for organization since this is a one-off document.

Of course, the naming conventions are as inconsistent as possible. I documented functions, and if you read them you’d see that documentation and even commenting conventions varied wildly. Clearly I have no style guide, and no linting or checking that inputs are as expected and on and on and on, a zoo of bad practice. These issues seem the more tractable to resolve:

  • Pick a naming convention and stick with it. Do not mix them. Perhaps consider writing your own style guide.
  • Use functions whenever you can, and keep them short.
  • Use packrat to manage dependencies to keep things consistent and reproducible.
  • roxygen2 is the standard for function documentation in R.
  • Versioning systems like git keep you from holding onto old versions out of fear of needing them in the future. Use versioning software.

Aside from these, though, I’m torn between two general approaches to project management, which I describe below.

Projects As Executables

I think Jon Zelner’s posts were the first posts I read about how one may want to organize a project for both reproducibility and ease of management. Mr. Zelner suggests approaching a project like how software developers approach an application, but instead of the final product being an executable, the final product is a paper. Some of the tips Mr. Zelner provides includes:

  • Treat R scripts as executables. That is, your scripts should be executable from the command line after placing a shebang at the beginning and running chmod +x script.R. They should accept command-line arguments. (I manage this using optparse, though he recommends other tools.)
  • Organize your scripts with a standard directory structure (keep all scripts in a directory like /R, all data in /data, all figures in /fig, etc.), and create a make file that describes their relationships. The make file keeps the project up-to-date, making sure that all dependencies between scripts and data files and figures are managed and repercussions from changes are fed forward appropriately.
  • Use knitr for report writing, but the bulk of the work is done elsewhere.
  • Manage reproducibility issues, like dependency issues, using packages that I don’t want to use. I may look to packrat for this.

The projects-as-executables idea resonated with me and I’ve worked to write fewer .Rmd files and more executable .R files. You’ll notice that many files in my project are executable; this is an attempt at implementing Mr. Zelner’s ideas. The entire GNU/Linux development pipeline is available to you for your project, and there are many tools meant to make a coder’s life easier (like make).

Months ago, when the reviewers of our paper said they wanted more revisions and the responsibility for those revisions fell squarely on me, I thought of fully implementing Mr. Zelner’s ideas, with make files and all. But the make file intimidated me when I looked at the web of files and their dependencies, including the 200+ PDF files whose names I don’t even know, or all the data files containing power simulations. Sadly, only a couple days ago I realized how to work around that problem (spoiler: it’s not really a problem).

Of course, this would require breaking up a file like ChangepointCommon.r. A file shouldn’t include everything. There should be a separate file for statistical functions, a file for functions relevant for simulations, and a file for functions that transform data, a file for functions that make plots, and so on. This is important to make sure that dependencies are appropriately handled by make; you don’t want the exhaustive simulations redone because a function that makes plots was changed. ChangepointCommon.r has been acting like a couple days ago when I thought of an answer to that problem. ChangepointCommon.r has been acting like a poor man’s package, and that’s not a good design.

Projects as Packages

That last sentence serves as a good segue to the other idea for managing a research project; Chris Csefalvay and Robert Flight’s suggestion to write packages for research projects. Csefalvay and Flight suggest that projects should be handled as packages. Everything is a part of a package. Functions are functions in a package. Analyses are vignettes of a package. Data are included in the /data directory of the package. Again: everything is a package.

Packages are meant to be distributed, so viewing a project as a package means preparing your project to be distributed. This can help keep you honest when programming; you’re less likely to take shortcuts when you plan on releasing it to the world. Furthermore, others can start using the tools you made and that can earn you citations or even an additional paper in J. Stat. Soft. or the R Journal. And as Csefalvay and Flight point out, when you approach projects as package development, you get access to the universe of tools meant to aid package development. The documentation you write will be more accessible to you, right from the command line.

My concern is that not everything I want to do seems like it belongs in a package or vignette. Particularly, the intensive Monte Carlo simulations don’t belong in a package. Vignettes don’t play well with the paper that my collaborators actually write or the format the journal wants.

I could get the benefit of Csefalvay and Flight’s approach and Mr. Zelner’s approach by: putting the functions and resulting data sets in a package; putting simple analyses, demonstrations, and de facto unit tests in package vignettes; and putting the rest in project directories and files that follow Mr. Zelner’s approach. But if the functions in the highly-unstable package are modified, how will make know to redo the relevant analyses?

I don’t know anything about package development, so I could be overcomplicating the issue. Of course, if the work is highly innovative, like developing a new, never-before-seen statistical method (like I’m doing), then a package will eventually be needed. Perhaps its best to lean on that method anyway for that fact alone.

Conclusion

I would like to hear others’ thought on this. Like I said at the beginning, I don’t know much about good practices. If there’s any tips, I’d like to hear them. If there’s a way to maximize the benefits of both Mr. Zelner’s and Csefalvay and Flight’s approaches, I’d especially like to hear about it. It may be too late for this project, but I’ll want to keep these ideas in mind for the future to keep myself from making the same mistakes.


I have created a video course published by Packt Publishing entitled Training Your Systems with Python Statistical Modelling, the third volume in a four-volume set of video courses entitled, Taming Data with Python; Excelling as a Data Analyst. This course discusses how to use Python for machine learning. The course covers classical statistical methods, supervised learning including classification and regression, clustering, dimensionality reduction, and more! The course is peppered with examples demonstrating the techniques and software on real-world data and visuals to explain the concepts presented. Viewers get a hands-on experience using Python for machine learning. If you are starting out using Python for data analysis or know someone who is, please consider buying my course or at least spreading the word about it. You can buy the course directly or purchase a subscription to Mapt and watch it there.

If you like my blog and would like to support it, spread the word (if not get a copy yourself)! Also, stay tuned for future courses I publish with Packt at the Video Courses section of my site.


  1. I did take classes from the School of Computing at the University of Utah, enough to earn a Certificate in Big Data, but they assumed programming knowledge; they didn’t teach how to program. That’s not the same thing. 
  2. Another problem is that I haven’t been taught to algorithmically analyse my code. Thus, sometimes I write super slow code that. 
  3. That said, while you can’t necessarily anticipate the end structure, you could take steps to minimize how much redesigning will need to be done when the unexpected eventually surfaces; this is important to programming defensively. 
Advertisements

14 thoughts on “How Should I Organize My R Research Projects?

    • The R Suite approach is:
      1) Use version control – either SVN or GIT
      2) Split solution into: master scripts and packages
      a) packages contain all the logic (encapsulation)
      b) master scripts contain flow – you use functions from your packages
      3) Use loggers to control the flow
      4) Use tests to control the quality
      5) Lock 3rd party package versions for reproducibility
      6) Lock R version for reproducibility
      7) Build your own CRAN with packages you use – either source or binary versions
      8) Run your solution using Rscript (no “hanging” session issues)
      9) Use Docker if you want to control sysrequirements reproducibility
      10) Use CI/CD (Jenking, Travis, Appveyour) to automate building and deployment

      You can walkthrough the basic workflow reading the docs – http://rsuite.io/RSuite_Tutorial.php#basic-r-suite-usage). It is very easy to use packages with R Suite – no excuse for not using packages 🙂 :
      – rsuite proj pkgadd -n mypackage (this add package to the project)
      – rsuite proj build (this builds package)

      I hope this is helpful for you.

      Liked by 1 person

  1. 1. Use git
    2. R code resides in R folder, use folders data, results, output for other stuff.
    3. Subprojects go into subfolders of R
    4. Use Rstudio project in the parent folder so that you can use relative paths.
    5. Code that is matured goes into package. Rule of thumb, if you source one R file in all new code but you rarely modify it, it needs to go into package.
    6. Use unit tests for packages, library(testthat)
    7. If you absolutely want a makefile use one R code which sources everything else.
    8. If you have other language code, use separate folders for that.

    This system works for me, both for individual projects and for colaborative ones.

    Liked by 1 person

  2. This sounds like a really hard problem for this stuff.

    One suggestion to combine the approaches (assuming you are using make, and a package), I’m guessing you want make to know when the underlying functions have changed. You could do a SHA on the package source, and when that changes, then you know you need to re-run.

    Alternatively, you could use RScript -e, and get a digest of a particular function in the package, and make that a dependency. An example, from the terminal, I can do this using the digest package: Rscript -e “digest::digest(lm)”

    This gets me a SHA signature of the contents of the “lm” function in R. You can do the same for any installed package by doing: “digest::digest(packageName::function)”

    Liked by 1 person

  3. I feel your pain. Although I’ve been programming since the Johnson Administration, I’m not programmer, and I’ve struggled as you’ve described.

    Here’s what I do to maintain a degree of sanity

    1. More than a few hours work gets pushed to github
    2. For file hierarchy, i follow YiHui Xie’s conventions in Bookdown
    3. I overcomment
    4. In development, I use standard names, such as df (for data frame), which means I can cut and paste canned snippets without having to edit them. I then go back and assign appropriate names.
    5. When a project gets enough functions, I package
    6. I use Rmarkdown, to keep track of why I do things, even if I don’t plan on making a paper out of it
    7. I keep a snippets library
    8. I’m not pure when it comes to R; I’ll bring in system commnds
    9. I live in the tidyverse, especially with magrittes
    10. Whenever an object is in usable form I save it to an Rda file.

    Like

  4. Hi Curtis,
    I empathize with your post. This is a difficult and deep problem that we also faced as well in trying to create some unity, quality and transparency for our relatively small group of biostatisticians and biomedical data scientists. We then spent some time trying to work on a possible solution, but I’m sure there are other great ones as well, especially depending on their preferences (containerization) and needs (i.e., polyglot).
    We actually wrote an R package to help with this called adapr (Accountable Data Analysis Process in R). This package objective is to allow users to link and organize R scripts in a way that there results and structure is transparent, not just reproducible. If you use git, then there are some added benefits of tracking the provenance of all files. The package is complex enough that typical R package manual documentation is insufficient for users to get started so we wrote a paper. I could go on for hours, but… Here is the link to the Article:
    https://journal.r-project.org/archive/2018/RJ-2018-001/RJ-2018-001.pdf
    here is the citation:
    Gelfond, J., Goros, M., Hernandez, B. and Bokov, A., 2018. A System for an Accountable Data Analysis Process in R. The R Journal.
    The package has github website https://github.com/gelfondjal/adapr for reporting issues.
    Best of luck,
    Jon

    Liked by 2 people

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s