Threat Detection on Twitter Using Corpus Linguistics

Conference Year

January 2019

Abstract

As social media increases in popularity, its ability to create culturally meaningful tools grows as well. One of the most promising tools is categorization software, which analyzes the linguistic data in social media posts to make predictions. It does with the help of corpus linguistics, a form of analysis designed to pick out the most frequent and/or significant words in a dataset. This study focuses on software intended to detect threats. While this technology has the potential to flag abusive language used by groups or individuals, the text search strategies it currently uses often result in a high number of false positives, making it too unreliable for effective use. The software is most effective at marking whether or not a specific word is present in a tweet, not determining whether or not this word is actually being used in a threatening way (e.g. "I'm planning on killing him" vs. "this silence is killing me"). Discourse analysis, which looks at the role context plays in language, could minimize these errors by helping researchers refine the software in a manner that more closely matches how people actually use language. The goal of this project, then, is to investigate ways of combining corpus linguistics and discourse analysis with a Twitter database to improve predictive analysis.

Primary Faculty Mentor Name

Julie Roberts

Status

Undergraduate

Student College

College of Arts and Sciences

Program/Major

Linguistics

Primary Research Category

Social Sciences

Abstract only.

Share

COinS
 

Threat Detection on Twitter Using Corpus Linguistics

As social media increases in popularity, its ability to create culturally meaningful tools grows as well. One of the most promising tools is categorization software, which analyzes the linguistic data in social media posts to make predictions. It does with the help of corpus linguistics, a form of analysis designed to pick out the most frequent and/or significant words in a dataset. This study focuses on software intended to detect threats. While this technology has the potential to flag abusive language used by groups or individuals, the text search strategies it currently uses often result in a high number of false positives, making it too unreliable for effective use. The software is most effective at marking whether or not a specific word is present in a tweet, not determining whether or not this word is actually being used in a threatening way (e.g. "I'm planning on killing him" vs. "this silence is killing me"). Discourse analysis, which looks at the role context plays in language, could minimize these errors by helping researchers refine the software in a manner that more closely matches how people actually use language. The goal of this project, then, is to investigate ways of combining corpus linguistics and discourse analysis with a Twitter database to improve predictive analysis.