When studying texts, not only as containers of linguistic data but also as recorded forms (books, scrolls, manuscripts, webpages, etc.) the analytical importance of material, paratextual and contextual evidence, embedded in and around the texts, are highlighted. The metadata concerning matters of publication, distribution, and reception (i.e. author’s gender, publisher, book price, print run size, sales and review records, etc.) qualify our historical understanding of literary texts and cultures.

There are two principal ways of addressing these issues with the help of digital methods. First, you can work on the contextual metadata alone in order to capture large-scale trends of literary production as evidenced by external factors. This strict quantitative approach, with roots in the enumerative aspirations of the sociology of literature and French book history inspired by the Annales School, is heavily facilitated by computers, in terms of archiving, counting, and visualizing data and trends. Secondly, you can infuse the metadata in the data mining analysis as features of the texts, adding social layers to the modelling of a literary phenomenon, space, or time-span.



Extracting metadata in an automated way is not easy. However, the analysis of metadata using digital methods can just as well be addressed using a more handheld and manual approach for a start. One simple approach is to create your own database by making a list of relevant features and manually compile data from different sources into a spreadsheet using e.g. Microsoft Excel or Google Sheets. For instance, WorldCat, the world’s most comprehensive online database of library content, can be used to manually collect a large set of metadata. Another option is to search for nominations in Literature in the Nomination Archive, a database of nominations to the Nobel Prizes created by the Nobel Foundation.

One advantage of manually collecting data and adding it to a spreadsheet by hand is that it allows you to tailor your database by combining information from a wide range of sources. A great way to explore your database and to discover trends is to use some of the freely available web applications allowing for all kinds of visualization of spreadsheet data such as RawGraphs by Density Design or Tableau Public. Explore the Tableau Public Gallery to get inspired by other’s work.


Without analyzing the actual contents of novels, metadata can tell about cultural trends in literature. For instance, some findings suggest that titles are getting longer. Build on this observation and dig deeper, adding other metadata features to your analysis. A useful way to computationally retrieve information and collect datasets is through an API (Application Programming Interface), an interface that enables interaction and data transmission with external software components and resources. There exists several APIs for book-related websites which might help you to compose your own metadata dataset. Another way is using an already compiled dataset, such as one from HathiTrust that is updated regularly. 

Visualize and analyse your dataset in Python (see e.g. a GoodReads visualization) to explore patterns in the dataset. Find out with the New York Times Book API how long on average a book stays on a bestseller list, as Scott Jones explored in this blog post. If you compare the bestseller listings with lists of liked books in GoodReads, how do their metadata differ? Compare the publication times of books in different lists to explore how temporary success is. Yet another interesting approach is to explore the relation of ratings and book prices. Building a GoodReads Scraper also allows for retrieving prices from Amazon that you can then use to test your own hypotheses about the relation of prices with e.g. ratings, genre or page count.


Scripts and sites


Remediation »