Wednesday, December 22, 2010

Ngram anomalies

Now that I've played with the Google Ngrams tool a little, I continue to think it's a powerful window into a lot of interesting questions. But I also see that there are patterns that emerge that are plainly spurious, and surely do not correspond to real changes in language, culture, or collective interest over time. It is easy to find examples of search terms that very plainly indicate that there is some kind of "instrument error", an observation that emerges because of an artifact of the method rather than a real pattern in the underlying behavior.

Fortunately it is possible to probe these areas of anomaly with the goal of figuring out what they mean. So let's see what happens when we pick out a set of common words that are not freighted with a lot of culturally specific significance. This will let us see more clearly how the instrument itself works.

Consider the color words red, green, yellow, blue, black. Let's graph the frequency of these terms in American English from 1800 to 2000. Before looking at the Ngram graph, let's consider what we would expect ex ante. Color words occur in books to designate -- color. Color terms are common words, so we might expect that they would remain fairly constant in frequency over time. So here is the null hypothesis about the frequency of common color terms: without a change in culture about color, we should expect the color words would remain roughly constant in frequency (flat curve). And the usage patterns for each term should be independent from the others. So we should expect a degree of independent random fluctuations in the frequencies of each color word, where "blue" bumps up in frequency in a given year and "red" bumps down.

(Why should we expect a degree of independence in the random fluctuations between "red" and "blue"? Because, fundamentally, there is no common mechanism that would link their behavior.)

Here are some ways in which the actual behavior of color terms might deviate from the null hypothesis.  Some colors may be more in style than others at a time -- there may be a cultural preference for red over blue, so the frequency of red may be greater than the frequency of blue. And the frequencies may change as cultural preferences change; so blue may become more frequent than red in a later generation. More generally, literary taste may change by becoming more descriptive overtime -- with more frequent use of color terms -- or more formal, with less use of color terms. So it would be possible to explain persistent differences in frequency of color terms; shifting frequencies across different color words; and even a longterm rise or decline in the whole family of color words.

So ex ante, for this group of common color words we would expect a graph of flat lines for the five terms, with uncorrelated fluctuations in each line.

Now let's look at the actual graph of these word frequencies (link).


Here we can see behavior that flatly contradicts these reasonable ex ante expectations. First, there are stretches of time in which the color words covary extremely closely, to the extent that the graphs look identical in shape. This is true, for example, in the neighborhood of 1820. This is impossible to explain as anything else than an artifact of some sort. It is impossible to believe that the frequencies of several color words would fluctuate up and down with this degree of synchrony.

Here is another aspect of the graph that is also suggestive of artifact: the long wave of rise and fall in the frequency of all the color words between 1810 and 1920. It is not impossible that "color" became more important in literary language and then declined; but that seems improbable. So this long wave coordinated behavior of the color words seems to be more likely the effect of a database anomaly than a manifestation of a real trend.

Is there any reliable information in this graph?  Yes.  There is one feature of this graph that appears to have real significance, and that is the change in the behavior of "black" after 1960. Prior to that year the term behaves pretty much like all the other color words. After that year it takes off on a very different trajectory. And this abrupt and accelerating increase in the frequency of "black" seems to have everything to do with a real social and cultural change in the 1960s and forward -- the abrupt increase in those decades in the salience of race. There is a similar divergence between the behavior of "black" and all the other color words in 1860; the frequency of the word increases for a few years following the American civil war.

More tantalizingly, it may be significant that "blue" moves up from "yellow" to "green" in frequency over time.  This is one element of the graph where the terms are not correlated with each other; instead, "blue" changes its position relative to other color frequencies.

This example shows that we need to be careful about the inferences we draw from the patterns that appear from Ngram searches. We need to always ask: "Does this pattern really correspond to a fact about underlying collective linguistic behavior, or is it the result of an artifact?" More fundamentally, we need to understand the sources of the artifacts we are able to detect -- spurious correlations, inexplicable long-wave changes in frequency, and others still to be discovered. And, finally, we should seek out techniques that can be applied to the results that serve to filter out the artifacts and focus on the real variations the data contain. We need some signal processing here to separate signal from noise. The Ngram tool is powerful, but we need to use it critically and intelligently.