Reproducible Quantitative Methods

Lesson 5

Intro to scripting/ Version control in R with Github

yeah bar

Topics and Resources

  1. Introduction to Scripting

    Most of you will have started making your way around Github and RStudio as we've talked about both, and this week, we're going to deal with this more formally- and make sure everyone is on the same page- so to prepare, I'd like you to read these resources on scientific computing:

    A recent feature in Nature- Scientific computing: Code alert

    Best Practices for Scientific Computing- a guide to scientific computing for the self-taught.

    Good Enough Practices in Scientific Computing- I love this paper because of its honesty: 'by definition, the "best" are a small minority. What practices are comfortably within reach for the "rest"?' You may find that the material in the first half of the paper looks familiar- it covers much of what we've talked about regarding data quality control in this course up until now.

  2. Data Cleaning in R

    Once you have your data into R, the first thing you're going to want to do is make sure it's clean and behaving as you'd expect it to. For the issues with your data that you identified last week, script ways to correct them using R. What you do here is highly dependent on the issues with your data, but the idea here is simplify our workflows by correcting errors in data that will be used again and again in a scripted, reproducible way- directly correcting errors from the original data source. Lots of opportunities to use gsub to correct typos, etc. Here’s a resource on the variety of things you can do this way. Here’s some inspiration. Scripted data cleaning: it CAN be done!!


    ProTip


    A helpful hint from those that came before

    Motivation in the script You may be hesitant to take this approach with ‘one-off’ data- like a spreadsheet that you only plan to use for one purpose- why not correct it directly in the spreadsheet file? -And that’s understandable, but there are key places where scripted data cleaning is important- for example, if you’re downloading weather station data directly for an analysis- these files are typically continuously updated, and the same things will need to be corrected every time this data is downloaded.

  3. Version Control

    We've used Github's Web interface now as a workspace for our projects. Now it's time to get you set up with local version control for your projects. Here's how to set up version control with Github is through the R Studio interface. Github in RStudio. You can also use Github Desktop, or, if you are hardcore, just plain git bash. I use Rstudio's git interface for R-based projects, Github desktop for all other projects and file management, and git bash under duress. (My non-R code, like this website, is composed in HTML using Atom as my coding environment, in case you were wondering- it works well with git, highlighting files that have changed compared to the previous commit.)

Exercises

  1. Github and R set-up
  2. Most of you will be set up in class, BUT! In case you're not, follow the tutorial linked above to set up github in R Studio on all students’ computers. Please attempt to create a project using version control through R. You *will* hit bugs. If the bugs are unresolvable, don't worry, you can use Github desktop to do the very same things.


    ProTip


    A helpful hint from those that came before

    It's not a bug, it's a feature. It's an opportunity! Believe it or not, bugs help people learn about problem solving in computational environments. I'll support you through the debugging, but I want you to lead as much as possible. Common problems include: firewall issues, O/S differences and git config problems, git files installing in surprising places. It is VERY UNLIKLEY you will break anything, so keep fiddling till it works!

Discussion

Reproducibility and replication

What makes a scientific experiment repeatable? Repoducible? Replicable? Explore these resources:

Replication frustration: what stops experiments being reliably repeated? -NB: Although this article presents some really interesting results of a cool project , the article itself uses the terms reproducibility and replication interchangeably. I DO NOT AGREE WITH THIS! See below article.

Reproducibility vs Replication Note: The definitions here talk about people going out and replicating and reproducing where we are more interested in the ability to replicate and reproduce.

Video

Scripts for Reproducible Research in R (R Tutorial 1.9) (6:21)

Questions

Why do we care so much about reproducibility?

How does scripting analysis improve reproducibility?

Previous Lesson | Home | Next Lesson