Reproducible Quantitative Methods

Lesson 4

Cleaning up messy data / Identifying 'grey' data sources

yeah bar

Topics and Resources

  1. Introduction to data cleaning

    Open Refine is a great, graphical interface for data cleaning. Data Carpentry has a great tutorial which walks you through using OpenRefine on a messy dataset. We're going to use this tool to do some quality control operations on our project data sets.

    There's also some important resources I'd like to explore as we move forward. Hadley Wickham's Tidy data sets out the principles of "tidy" datasets and offers instruction for how to clean them in R.

    The Quartz guide to bad data is a great resource for understanding the many, many ways data can go wrong, and offers readers suggestions for how to fix it


    ProTip


    A helpful hint from those that came before

    Yes, you may. The Quartz Guide is great because it clearly delineates where students will need to go back to the data creator or consult an expert. Sometimes, people need ‘permission’ to ask for help and this guide gives clear scenarios where you should.

  2. Grey data liberation

    Data is all around us, and we, as humans, are generating it constantly as we go about our daily business. Sometimes you need to think about it a little harder before you realize it's actually data- for example, there's a lot of information being produced simply by people uploading photos to the internet. However, there's also a lot of classic research data (and literature) that never sees the light of day- the producers of the data, for whatever reason, have not published on it using traditional academic channels. But you can find out a lot of cool things if you're willing to dig a bit for your data. One of the goals of this course is, in addition to teaching all the skills we need to make our own work reproducible, is to provide a mechanism for liberating grey data.

Exercises

  1. Personas
  2. Do a “Persona” exercise designed to help understand goals and motivations of potential data donors. The persona is an imaginary data producer, and our goal is to get that person to share their data. Based on real-world observations and understandings of actual potential or current donors, sketch out this persona, and identify his or her motivations. The persona is used in business and software development to help designers understand and empathize with their users.

Discussion

Grey data liberation

Refer back to Simon Leather’s post on unused data that needs love.

Reading

Grey literature- ask your students- how does this apply to data?

Questions

Why is it desirable to see grey data published?

Why might a data producer not publish on the data they've produced?

What sort of questions can we ask of grey data?

How do we convince producers of grey data to work with us?

Previous Lesson | Home | Next Lesson