Date of Award
Doctor of Philosophy (PhD)
Peter S. Dodds
Christopher M. Danforth
The Google Books corpus contains millions of books in a variety of languages. Due to this incredible volume and its free availability, it is a treasure trove that has inspired a plethora of linguistic research.
It is tempting to treat frequency trends from Google Books data sets as indicators for the true popularity of various words and phrases. Doing so allows us to draw novel conclusions about the evolution of public perception of a given topic. However, sampling published works by availability and ease of digitization leads to several important effects, which have typically been overlooked in previous studies. One of these is the ability of a single prolific author to noticeably insert new phrases into a language. A greater effect arises from scientific texts, which have become increasingly prolific in the last several decades and are heavily sampled in the corpus. The result is a surge of phrases typical to academic articles but less common in general, such as references to time in the form of citations. We highlight these dynamics by examining and comparing major contributions to the statistical divergence of English data sets between decades in the period 1800--2000. We find that only the English Fiction data set from the second version of the corpus is not heavily affected by professional texts, in clear contrast to the first version of the fiction data set and both unfiltered English data sets.
We critique a method used by authors of an earlier work to determine the birth and death rates of words in a given linguistic data set. While intriguing, the method in question appears to produce an artificial surge in the death rate at the end of the observed period of time. In order to avoid boundary effects in our own analysis of asymmetries in language dynamics, we observe the volume of word flux across various relative frequency thresholds (in both directions) for the second English Fiction data set. We then use the contributions of the words crossing these thresholds to the Jensen-Shannon divergence between consecutive decades to resolve major factors driving the flux.
Having established careful information-theoretic techniques to resolve important features in the evolution of the data set, we validate and refine our methods by analyzing the effects of major exogenous factors, specifically wars. This approach leads to a uniquely comprehensive set of methods for harnessing the Google Books corpus and exploring socio-cultural and linguistic evolution.
Number of Pages
Pechenick, Eitan, "Exploring the Google Books Corpus: An Information-Theoretic Approach to Linguistic Evolution" (2015). Graduate College Dissertations and Theses. 525.