Reproducible Quantitative Methods
Lesson 4
Topics and Resources
-
Special Guest Dr. Heather Soyka
On February 6th, we will be joined by Dr. Heather Soyka from the School of Information right here on campus. Before joining Kent State last fall, Dr. Soyka was a fellow at DataONE- the Data Observation Network for Earth (DataONE is a project that's leading how researchers in the environmental sciences in dat management and preservation). Dr. Soyka will be extending last week's discussion on metadata to infrastructures that support data preservation and organization. In preparation for Dr. Soyka's visit, please read:
Required Data Management Training for Graduate Students in an Earth and Environmental Sciences Department
It Takes a Library to Preserve a Scientific Database: A Collaborative Exploration of Database Preservation
And give this one a skim:
Developing an Approach for Data Management Education: A Report from the Data Information Literacy Project
-
Introduction to data cleaning
Open Refine is a great, graphical interface for data cleaning. Data Carpentry has a great tutorial which walks you through using OpenRefine on a messy dataset. We're going to use this tool to do some quality control operations on our project data sets.
There's also some important resources I'd like to explore as we move forward. Hadley Wickham's Tidy data sets out the principles of "tidy" datasets and offers instruction for how to clean them in R.
The Quartz guide to bad data is a great resource for understanding the many, many ways data can go wrong, and offers readers suggestions for how to fix it
ProTip
A helpful hint from those that came before
Yes, you may. The Quartz Guide is great because it clearly delineates where students will need to go back to the data creator or consult an expert. Sometimes, people need ‘permission’ to ask for help and this guide gives clear scenarios where you should.
-
Grey data liberation
Data is all around us, and we, as humans, are generating it constantly as we go about our daily business. Sometimes you need to think about it a little harder before you realize it's actually data- for example, there's a lot of information being produced simply by people uploading photos to the internet. However, there's also a lot of classic research data (and literature) that never sees the light of day- the producers of the data, for whatever reason, have not published on it using traditional academic channels. But you can find out a lot of cool things if you're willing to dig a bit for your data. One of the goals of this course is, in addition to teaching all the skills we need to make our own work reproducible, is to provide a mechanism for liberating grey data.
Exercises
- Personas
Do a “Persona” exercise designed to help understand goals and motivations of potential data donors. The persona is an imaginary data producer, and our goal is to get that person to share their data. Based on real-world observations and understandings of actual potential or current donors, sketch out this persona, and identify his or her motivations. The persona is used in business and software development to help designers understand and empathize with their users.
Discussion
Grey data liberation
Refer back to Simon Leather’s post on unused data that needs love.
Reading
Grey literature- ask your students- how does this apply to data?
Questions
Why is it desirable to see grey data published?
Why might a data producer not publish on the data they've produced?
What sort of questions can we ask of grey data?
How do we convince producers of grey data to work with us?