Date of Award

2022

Document Type

Dissertation

Degree Name

Doctor of Philosophy (PhD)

Department

Complex Systems

First Advisor

Peter S. Dodds

Second Advisor

Christopher M. Danforth

Abstract

The proliferation of digital data across all areas of society has transformed our ability to hypothesize, study, and understand social systems.From this richness of data we have seen the development of innovative instruments to study---and make decisions with---the digital artifacts of the modern day. These developments build on advancements in computation, connectivity, analytical methodologies, and sociological theories. The sociotechnical instruments we have developed have been revolutionary to how we understand society and how we conduct business, but with these broad leaps comes ample room (and need) for more nuanced advancements. As with the development of any field, as the digital humanities evolve there is opportunity for targeted progress and the need for more tectonic shifts in practices. Iterative improvements include building more full-featured instruments that include a broader set of variables when analyzing and presenting results. More profound topics such as fairness, accountability, transparency, and ethics need increased attention as well---especially to create equitable, pro-social tools. Both in academia and in industry, there is room to improve how we curate, study, and operationalize data sets and the AI pipelines that sit atop them. Here we use natural language processing, machine learning, tools from data ethics, and other methods to explore how we can contextualize results and improve representations within instruments used to understand sociotechnical systems.

In the first study we examine the dynamics of responses to posts by US presidents on Twitter. These results offer a piece of culturally significant data in themselves---the ratio of response types is an unofficial measurement on the platform. Moreover, the results improve our understanding of the temporal dynamics that lead to the final counts that users may ultimately see. Deeply analyzing response activity dynamics provides insights on how the public responds to posts, the tenacity of supporters, and abnormalities that may be indicative of inauthentic behavior.

The second study examines the interaction between gender biases in health records and language models and how to mitigate these biases. We present specific language that is more commonly associated with female and male patients. We go on to demonstrate how the deliberate augmentation of text can minimize the gender signal present in data while retaining performance on medically relevant tasks. We conclude by showing how much of this bias is domain specific, and the non-trivial interaction with general-purpose language models.

Our final study investigates gender bias in resume text and relates this bias to the gender wage-gap. We show that language differences within occupations are associated with the gender pay gap. Our results highlight the value of utilizing high dimensional representations of individuals and the potential for previously undocumented biases to influence hiring pipelines.

Language

en

Number of Pages

213 p.

Share

COinS