Book-scraping

Via several of my Twitter contacts: The Times Online has developed Book Scraper, a literary text analysis tool with 126 books in its database so far, from the 16th through the early 20th centuries. You can look at word clouds and lists of unique and particularly long words for each text (check out the long word list for Ulysses); you can compare two texts and see how much vocabulary they share, accompanied by a Venn diagram; and you can search individual words and see graphs of how often they appear, in which texts.

It has its flaws. As one friend commented on Twitter: "More books! And where's the API?" And the text analysis isn't perfect: it's clear from the Shakespeare page that stage directions and older spellings affect the statistics somewhat. But I like the word graphs, even though they're a bit skewed by the relatively small sample of texts in the database. Look at the graph for "amiable": a smattering of early 17th-century uses, mostly Shakespeare; then a set of giant bubbles from early- to mid-19th-century novels, with the heaviest concentrations in Jane Austen. I once wrote a short paper for an undergraduate Austen class on the word "amiable" in Emma. The Book Scraper graph is an interesting confirmation of how often that word shows up, not only in Austen, but in her near contemporaries as well.

It would be nice if the text set were big enough to do some higher-end data mining. But it's intriguing to see this kind of thing being done for an audience outside of academia. I wonder if this is a sign of literary data-mining going mainstream?

Comments are closed.