Date of Award

2015

Document Type

Dissertation

Degree Name

Doctor of Philosophy (PhD)

Department

Mathematical Sciences

First Advisor

Peter S. Dodds

Second Advisor

Christopher M. Danforth

Abstract

The Google Books corpus contains millions of books in a variety of languages. Due to this incredible volume and its free availability, it is a treasure trove that has inspired a plethora of linguistic research.

It is tempting to treat frequency trends from Google Books data sets as indicators for the true popularity of various words and phrases. Doing so allows us to draw novel conclusions about the evolution of public perception of a given topic. However, sampling published works by availability and ease of digitization leads to several important effects, which have typically been overlooked in previous studies. One of these is the ability of a single prolific author to noticeably insert new phrases into a language. A greater effect arises from scientific texts, which have become increasingly prolific in the last several decades and are heavily sampled in the corpus. The result is a surge of phrases typical to academic articles but less common in general, such as references to time in the form of citations. We highlight these dynamics by examining and comparing major contributions to the statistical divergence of English data sets between decades in the period 1800--2000. We find that only the English Fiction data set from the second version of the corpus is not heavily affected by professional texts, in clear contrast to the first version of the fiction data set and both unfiltered English data sets.

We critique a method used by authors of an earlier work to determine the birth and death rates of words in a given linguistic data set. While intriguing, the method in question appears to produce an artificial surge in the death rate at the end of the observed period of time. In order to avoid boundary effects in our own analysis of asymmetries in language dynamics, we observe the volume of word flux across various relative frequency thresholds (in both directions) for the second English Fiction data set. We then use the contributions of the words crossing these thresholds to the Jensen-Shannon divergence between consecutive decades to resolve major factors driving the flux.

Having established careful information-theoretic techniques to resolve important features in the evolution of the data set, we validate and refine our methods by analyzing the effects of major exogenous factors, specifically wars. This approach leads to a uniquely comprehensive set of methods for harnessing the Google Books corpus and exploring socio-cultural and linguistic evolution.

Language

Number of Pages

109 p.

Recommended Citation

Pechenick, Eitan, "Exploring the Google Books Corpus: An Information-Theoretic Approach to Linguistic Evolution" (2015). Graduate College Dissertations and Theses. 525.
https://scholarworks.uvm.edu/graddis/525

Download

Included in

Anthropological Linguistics and Sociolinguistics Commons, Applied Mathematics Commons, Computer Sciences Commons

COinS

Graduate College Dissertations and Theses

Exploring the Google Books Corpus: An Information-Theoretic Approach to Linguistic Evolution

Date of Award

Document Type

Degree Name

Department

First Advisor

Second Advisor

Abstract

Language

Number of Pages

Recommended Citation

Included in

Search

Browse

Author Corner

Graduate College Dissertations and Theses

Exploring the Google Books Corpus: An Information-Theoretic Approach to Linguistic Evolution

Author

Date of Award

Document Type

Degree Name

Department

First Advisor

Second Advisor

Abstract

Language

Number of Pages

Recommended Citation

Included in

Share

Search

Browse

Author Corner