Network News

X My Profile
View More Activity
Anchored by Melissa Bell  |  About  |  Get Updates:  Twitter  |   Facebook  |  RSS Feeds RSS Feed
Posted at 1:14 PM ET, 12/17/2010

Google parses this history of 500 billion words

By Melissa Bell

books.jpg

Word lovers, rejoice. Google, with the help of Harvard researchers, has created the "Books Ngram Viewer," a dream database to understand the historical and cultural changes of 500 billion words. The data set culled from these words can be downloaded for scholars, but for the amateur student of lexicography, a simple online tool Google Labs built lets people map the changes in our vocabulary since 1500 A.D. For instance, "women" and "men" did not meet in use until 1982. Now "women" appears more often than "men." It's an endlessly entertaining game. And the study, which appeared in the journal Science, has already come up with some fascinating tidbits:

1. The digitized texts make up only 4 percent of all books ever printed. They digitized 5,195,769 books. And that's only 4 percent!? That means 129 million books have been printed? Holy cannoli.

2. Even though this is only 4 percent of all the books ever printed, "If you tried to read only the entries from the year 2000 alone, at the reasonable pace of 200 words/minute, without interruptions for food or sleep, it would take eighty years."

3. I just read "The Professor and the Madman," about the monumental task of putting together the Oxford English Dictionary. Despite the Herculean task (and 20 volumes of text) the computer scanning shows that the dictionary only covers about 25 percent of existing words.

4. Actors become famous at 30 years of age. Politicians at 50. Scientists at 60 and Mathematicians never.

Play around with the tool here.

Update: Love this! One of the readers Peter Pappas points out that everyone needs to go to the Ngram site and type in the words "never gonna give you up." That is all. Thanks, Peter!

By Melissa Bell  | December 17, 2010; 1:14 PM ET
Categories:  The Daily Catch  
Save & Share:  Send E-mail   Facebook   Twitter   Digg   Yahoo Buzz   Del.icio.us   StumbleUpon   Technorati   Google Buzz   Previous: Jon Stewart's campaign for the Zadroga bill (Video)
Next: Word Lens iPhone app: Yet another reason to book that trip to Argentina

Comments

Neat.

Posted by: ozpunk | December 17, 2010 3:07 PM | Report abuse

I've been trying to find a word or phrase that generates a graph that looks like a graph of the Dow since it started. So far the best I've found is "design" (1930-2008, English). I would be interested to know if other people find better matches.

Posted by: paulm9 | December 17, 2010 3:53 PM | Report abuse

Google's "Books Ngram Viewer" is another free online tool that allows us to visualize information in new ways. I explore how to use the tool in the classroom to help students better understand the research method in my blog post - "How To Quantify Culture? Explore 500 Billion Published Words With Google's Ngram Viewer" http://bit.ly/gcKJdp

PS - It includes a Rickrolling Easter Egg - Search for "never gonna give you up" and see what pops up!

Posted by: peterpappas | December 17, 2010 4:28 PM | Report abuse

Looking at the pretty charts in Culturomics and the new Google Books interface is nice. But of course there is much more to looking at cultural / language change than just using simple frequency charts of exact words and phrases.

The NEH-funded, 400 million word Corpus of Historical American English (freely available at http://corpus.byu.edu/coha) allows for a much wider ranges of searches. Besides frequency lists like Google Books (with essentially the same results), simple 2-3 second searches can find changes in word meaning and usage (e.g. gay, care, web; or what we're saying about any topic over time), grammatical changes, and it can find *all words* that are more frequent in one period than another (rather than one by one, as with Google Books), as well as much more.

More information at:
http://corpus.byu.edu/coha/compare-culturomics.asp

Posted by: CorpusProf | December 20, 2010 10:56 AM | Report abuse

Post a Comment

We encourage users to analyze, comment on and even challenge washingtonpost.com's articles, blogs, reviews and multimedia features.

User reviews and comments that include profanity or personal attacks or other inappropriate comments or material will be removed from the site. Additionally, entries that are unsigned or contain "signatures" by someone other than the actual author will be removed. Finally, we will take steps to block users who violate any of our posting standards, terms of use or privacy policies or any other policies governing this site. Please review the full rules governing commentaries and discussions.




characters remaining

 
 
RSS Feed
Subscribe to The Post

© 2010 The Washington Post Company