Primary Sources as Data

Historians can organize primary sources in many different ways, including tidy datasets, which allow for the data to be more clear and accessible. Primary sources, like the “Curator’s Record of Donations to the Cabinet, 1769-1818┬áVolume 1: 3 February 1769-20 February 1818” by the America Philosophical Society, for example, can be considered “messy” data. While there are variables that form columns, like in a tidy dataset, there are also multiple variables in one column, making it unclear. Additionally, due to the outdated (albeit, beautiful) handwriting, it can be difficult for some people to understand. The fact that it is a scanned page, rather than typed makes it inaccessible for those who require text-to-speech technology, and the poor handwriting can be especially difficult for those with vision problems or for ESL speakers, who are unaccustomed to cursive / script handwriting. Therefore, for these reasons, it can be useful to translate this information into a tidy dataset.

For my tidy dataset, I included the variables (columns), “date of meeting”, “donor”, “donation”, and “description”, which were already recorded as columns in the original record. However, in order to tidy it up, I added two more variables, “origin of donor” and “orders”, in order to split up the original “donor” and “description” columns, so that each variable would have its own column. Additionally, I removed the repeated information about who the donor was from the description of each donation. I believe that this corresponds with Hadley Wickham’s principles of tidy data which includes each variable forming a column, each observation forming a row, and each type of observational unit forming a table.

Now that we have a tidy dataset we can utilize it to make analysis easier. As Wickham describes, we can utilize tidy tools to manipulate, visualize, or model our tidy dataset. Manipulation includes variable-by-variable transformation, aggregation, filtering, and sorting / reordering. Filtering is subsetting or removing observations based on some condition. Transforming is adding or modifying variables. Aggregating includes collapsing multiple values into a single value. And sorting is changing the order of observations. Visualization includes using tidy data as an input in order to create a visual output, like a map or graph. Finally, modeling includes using modeling functions to create transformations to the data as outputs.

Leave a Reply