Date of Award


Document Type


Degree Name

Doctor of Philosophy (PhD)


Complex Systems and Data Science

First Advisor

Christopher M. Danforth

Second Advisor

Peter S. Dodds


Unprecedented growth in digital information has transformed data in the social sciences from scarce to abundant. Social media in particular has created a new algorithmically mediated sociotechnical system, one where billions of daily communications influence our perspective on reality in poorly understood ways. Boosted by advances in high performance computing, natural language processing, and machine learning, the digital traces left behind by these electronic breadcrumbs hold immense promise for measuring collective attention and sentiment at the societal scale.

In one study, using hurricane name mentions as a proxy for awareness. We find that the exogenous temporal dynamics are remarkably similar across storms, but that overall collective attention varies widely even among storms causing comparable deaths and damage. We construct `hurricane attention maps' and observe that hurricanes causing deaths on (or economic damage to) the continental United States generate substantially more attention in English language tweets than those that do not. We find that a hurricane’s Saffir-Simpson wind scale category assignment is strongly associated with the amount of attention it receives. Higher category storms receive higher proportional increases of attention per proportional increases in number of deaths or dollars of damage, than lower category storms. The most damaging and deadly storms of the 2010s, Hurricanes Harvey and Maria, generated the most attention and were remembered the longest, respectively. On average, a category 5 storm receives 4.6 times more attention than a category 1 storm causing the same number of deaths and economic damage.

In a second study, we explore using well curated, large-scale corpora of social media posts containing broad public opinion as an alternative data source to complement traditional surveys. While surveys are effective at collecting representative samples and are capable of achieving high accuracy, they can be both expensive to run and lag public opinion by up to a month. Both of these drawbacks could be overcome with a real-time, high volume data stream and fast analysis pipeline. A central challenge in orchestrating such a data pipeline is devising an effective method for rapidly selecting the best corpus of relevant documents for analysis. Querying with keywords alone often includes irrelevant documents that are not easily disambiguated with bag-of-words natural language processing methods. Here, we explore methods of corpus curation to filter irrelevant tweets using pre-trained transformer-based models, fine-tuned for our binary classification task on hand-labeled tweets. We are able to achieve F1 scores of up to 0.95. The low cost and high performance of fine-tuning such a model suggests that our approach could be of broad benefit as a pre-processing step for social media datasets with uncertain corpus boundaries.

In a third chapter, I describe my contributions to nearly 2 dozen studies leveraging Twitter data to explore collective attention, sentiment, and language. These cover a range of topics including the COVID-19 pandemic, politicians and K-pop stars, public health, and social movements, demonstrating the broad value of social media data in interdisciplinary research.



Number of Pages

151 p.

Available for download on Thursday, October 17, 2024