Data Compilation & Cleaning

With the exception of the blog posts, I compiled the data so it could be analyzed, filtered, and manipulated as a whole. I gathered all comments from the blog posts into a single Google Sheet and did the same for the text and metadata of all tweets. The comments did not require cleaning in order to be useful in text analysis and data analysis programs (particularly R). The tweets did require cleaning due to the prevalence of emojis, hashtags, and user screen names. These elements create distractions in text analysis packages. Hashtags, for instance, consistently show up as the most frequent words in the dataset unless they are removed.

I kept one copy of the original downloads of the tweets and created a second for cleaning. Initially, I cleaned the tweets manually using find and replace in Google Sheets and Excel. I later returned to the full dataset of tweets and cleaned them a second time using R. Following the practices outlined by Julia Silge and David Robinson in Text Mining with R and elaborated by Kris Shaffer in “Mining Twitter Data with R, TidyText, and TAGS,” I used the tidytext and tidyverse packages in R to remove stopwords, hashtags, usernames, and URLs. Removing these words and characters created a clean body of text that could be easily filtered, transformed, and analyzed using text analysis, especially R.