Who decides what good literature is? Even if we trust the processes of canonization to keep the best works in collective memory to some degree, we also know that there are many standards, many biases, many individual idiosyncrasies that make the question of quality very complicated.

Computational methods have made it easier to study canonization and markets, and with online fora such as Goodreads, there are even sources for understanding what cohorts of readers think of certain works. Such metrics are not flawless, but they are also hard to ignore as indications of what has value in literary cultures. Data on translations, sales, valuations, and so on, may be useful for criticizing fixed ideas about what are regarded as canonical works.

In many ways, this sociological approach to the study of quality is relatively straightforward, and often it may be very helpful to a larger study to present the impact of certain works and authors.

A more difficult question to tackle, also without computers, is whether literary quality relates to intrinsic qualities in the writing. This is no trivial task, as many works that are considered masterpieces are difficult, if not impossible, to describe with a specific set of criteria. Among other institutions, Stanford Literary Lab has done very interesting work on understanding canonical and noncanonical works, and found significant differences in the predictability of word usage, for example. Others applied sentiment analysis to narratives and found a correlation between the general valuation of works and the predictability of the narrative, with indications that there is a “Goldilocks zone,” where the narrative is neither too chaotic nor too predictable.

So, although modelling literary quality is complex, there are certainly things that may be accomplished, particularly to understand larger corpora of texts, but also when trying to build such models, to better understand the attraction of literature.



Compare how different websites rank literature – both in terms of ratings and sales. Amazon, for example, has an accessible and constantly updating list that ranks how well books sell. While book sales can hardly be said to be the be-all and end-all measure of literary quality, it is certainly telling of a book’s popularity and a reminder that literature, amongst other things, is also a marketable product. Which qualities does George Orwell’s Nineteen Eighty-Four for example possess that led it to be catapulted to the top of Amazon’s book sales in 2017, and what does it say about the preferences of Amazon’s book buying customers?

Another source is Goodreads, a social media site dedicated to the reading, reviewing and discussion of books, meaning that user reviews are available for a vast selection of literature, especially when it’s available in English. Information from Goodreads reviews can be derived in an automated way by scraping with the Goodreads API and afterwards analysed using computational approaches. However, scraping web data is no easy task and does quickly become a rather complicated task. You can, however, approach Goodreads reviews in a similar yet more simple way by creating your own database by manually compiling data from different sources into a spreadsheet using e.g. Microsoft Excel or Google Sheets. The data in your spreadsheet can be visualised afterwards using some of the freely available web applications allowing for all kinds of visualization of spreadsheet data such as RawGraphs by Density Design.

Although user reviews from Goodreads are also not to be simply equated with literary quality, it once again can be seen as one of many ways to measure and discuss literary quality. Which qualities, for example, has led 6.6 million users to give Harry Potter and the Sorcerer’s Stone an average rating of 4.47/5.00, and what does it say about the renowned scholar and critic Harold Bloom’s understanding of literary quality that he famously wrote a strongly negative review of precisely that book in the Wall Street Journal?

See Mads Rosendahl Thomsen’s “From Data to Actual Context” for an elaborated discussion of the relationship between circulation and quality and what role websites like Goodreads and Amazon might play in such debates.


The computational study of literary quality is relatively new, and research in it is very exploratory. As algorithms do not have any prior knowledge or assumptions about literature and what “good” literature should be like, they have the potential of teaching us humans something new about the characteristics of literary quality. 

Pioneering work in the field has discovered some interesting implications of literary quality. Comparing prominent and obscure works (see Underwood, 2016, and Algee-Hewitt et al., 2016) implies that algorithms can successfully classify canonical works. The notion of quality is therefore not fully arbitrary, and literary language seems to have some intrinsic features, that can be approximated through e.g. sentence length, vocabulary richness and compressibility. 

To confirm the relevance of the chosen features to our understanding of literary quality, the machine-produced measures need to be evaluated against human ratings as a “gold standard”. The rating of quality can be based on several measures: literary critics, references in academic discussion or crowd-sourcing readers, for instance from Goodreads opinions or with surveys. 

Explore ‘literariness’ as a textual feature from a data-driven approach by replicating an analysis done by van Cranenburgh and Bod (2017). Run an example with a toy dataset from Project Gutenberg, using Cranenburgh's original code from their GitHub repository. The model predicts the number of downloads based on extracted text features. Collect your own corpus of texts and combine it with a quality score, for instance in terms of GoodReads ratings, to see how the model performs on a new dataset. For a more simple approach, use the functions “oppose” and “classify” of the Stylo package in R on a corpus of texts classified as “high-literary” and “low-literary” documents. Does the report reveal any interesting insights into the syntactic or lexical features of your corpus?


Scripts and sites