Books

Introduction

When studying texts, not only as containers of linguistic data but also as recorded forms (books, scrolls, manuscripts, webpages, etc.), the analytical importance of material, paratextual, and contextual evidence, embedded in and around the texts, are highlighted. The metadata concerning matters of publication, distribution, and reception (i.e. author’s gender, publisher, book price, print run size, sales, and review records, etc.) qualify our historical understanding of literary texts and cultures.

There are two principal ways of addressing these issues with the help of digital methods. First, you can work on the contextual metadata alone in order to capture large-scale trends of literary production as evidenced by external factors. This strict quantitative approach, with roots in the French book history inspired by the Annales School, is heavily facilitated by computers, in terms of archiving, counting, and visualising data and trends. Secondly, you can infuse the metadata in the data mining analysis as features of the texts, adding social layers to the modelling of a literary phenomenon, space, or time-span.

Applications

Elementary

Extracting metadata in an automated way is not easy. However, the analysis of metadata using digital methods can just as well be addressed using a more handheld and manual approach for a start. One simple approach is to create your own database by making a list of relevant features and manually compiling data from different sources into a spreadsheet using e.g. Microsoft Excel or Google Sheets. For instance, WorldCat, the world’s most comprehensive online database of library content, can be used to manually collect a large set of metadata. Another option is to search for nominations in Literature in the Nomination Archive, a database of nominations to the Nobel Prizes created by the Nobel Foundation.

One advantage of manually collecting data and adding it to a spreadsheet by hand is that it allows you to tailor your database by combining information from a wide range of sources. A great way to explore your database and to discover trends is to use some of the freely available web applications allowing for all kinds of visualisation of spreadsheet data such as RawGraphs by Density Design or Tableau Public. Explore the Tableau Public Gallery to get inspired by other’s work.

Advanced

Without analysing the actual contents of novels, metadata can tell about cultural trends in literature. For instance, some findings suggest that titles are getting longer. Build on this observation and dig deeper, adding other metadata features to your analysis. A useful way to computationally retrieve information and collect datasets is through an API (Application Programming Interface), an interface that enables interaction and data transmission with external software components and resources. There exist several APIs for book-related websites which might help you to compose your own metadata dataset. Another way is using an already compiled dataset, such as one from HathiTrust that is updated regularly.

Visualise and analyse your dataset in Python (see e.g. a GoodReads visualisation) to explore patterns in the dataset. Using the New York Times Book API, examine how long a book stays on a bestseller list on average – as Scott Jones has explored in this blog post. If you compare the bestseller listings with lists of liked books on Goodreads, how do their metadata differ? Compare the publication times of books in different lists to explore how temporary their success is. Yet another interesting approach is to explore the relation of ratings and book prices. Building a GoodReads Scraper also allows for retrieving prices from Amazon that you can then use to test your own hypotheses about the relation of prices with e.g. ratings, genre, or page count.

Resources

Scripts and sites

A GoodReads scraper script, repository on GitHub.
RAW Graphs, an open source data visualisation framework.
Tableau Public, a free platform for creating and sharing data visualisations.
A tutorial on analysing NYT bestseller books.
The New York Times Books API to retrieve book reviews and bestseller lists.
A list for book-related APIs.
A tutorial and analysis of book title length.

A GitHub Repository for an analysis of a Goodreads dataset.

Articles

Algee-Hewitt, M., et al. (2016) Canon/Archive. Large-scale Dynamics in the Literary Field. Stanford Literary Lab Pamphlet 11. https://litlab.stanford.edu/LiteraryLabPamphlet11.pdf
Jockers, M. L., & Mimno, D. (2013). Significant themes in 19th-century literature. Poetics, 41(6), 750-769. https://doi.org/10.1016/j.poetic.2013.08.005

Wang, X., Yucesoy, B., Varol, O., Eliassi-Rad, T., & Barabási, A. L. (2019). Success in books: predicting book sales before publication. EPJ Data Science, 8(1), 1-20. http://dx.doi.org/10.1140/epjds/s13688-019-0208-6

Remediation »

Revised 15.11.2022