Textual Analytics and Corpus Linguistics: Beyond the Hypothesis

One of my modules this semester is looking at “computer-assisted approaches to text analysis”. Or at least that is the module overview says. In reality, we are looking at research questions in the humanities and how digital methods for text analysis and corpus linguistics can help us answer them.

From the first two seminars, the idea that has stuck with me is that most humanities research does not go beyond forming a hypothesis. If we take a scientific approach (as digital humanities borrows a lot form computer science, this isn’t difficult), we can read a history or English research paper as an exploratory discussion of the topic. However, the conclusion often does not beyond what we have discovered through the course of this discussion.

Why is textual analytics different? The main reason, and the one I am interested in, is that we can look at a larger sample of data by using computers. We can compare different elements of multiple texts, allowing us to understand them quantitatively as well as qualitatively.

For example, my chosen thesis topic is on how the presentation of Sir Gawain changes in Arthurian literature and film. I could just use a close reading approach but this would be largely reliant on my own interpretation of the texts and small samples of data. In other words, qualitative research. If I use digital methods, I can compare the frequencies of adjectives related to Sir Gawain within each texts. This will allow me to demonstrate how Sir Gawain’s identity has changed through time, making my research quantitative.

Why add this quantitative element to humanities research? By adding an analytical aspect to research, humanities researchers can prove their hypothesis instead of relying on their own intuition and interpretation.

DH Project Conclusion

I have handed in all four essays from last semester! So, having abandoned this blog for a few weeks, I’m now back to sharing my thoughts on the world of digital humanities once a week.

It also means that I reached a conclusion on my 1 Chronicles project for Introduction to Digital Humanities. After a lot of frustration, I ended up focusing on messy data and whether specific chapters in 1 Chronicles could be considered “messy”. It was not an easy topic by any means but I did reach ac conclusion.


As far as the genealogy lists in 1 Chronicles 1-9 are concerned, whether the data is messy or not is in relation to the methods and tools being used. Though it seems obvious now, it really didn’t feel obvious at the time.

It’s not a groundbreaking conclusion but it introduced me to the trials and tribulations of digital textual analysis. It also showed me the fun and interesting side (yes, I said that) to digital humanities. I’m even looking forward to the textual analysis module next term!

DH Project : Messy Data with Stylistic Impact?


Rolling Delta comparing 2 Chronicles with the whole of 1 Chronicles, 1 Chronicles 1-9 (genealogies), and 1 Chronicles 10-29 (narrative).


Dendrogram showing how the six books I have been analysing cluster according to most frequent words used. 

I have finally decided on a question for my project on 1 Chronicles. I’m still not sure how I felt about beginning my research without a question to shape it but that is a discussion for another day.

I have decided to look at whether the first nine chapters of 1 Chronicles, the genealogical lists, should be considered messy data. The reason for this is that my rolling deltas and dendrograms have suggested different interpretations. As you can see from the dendrogram, removing the genealogies from the narrative in 1 Chronicles makes little difference to the most frequent word count. In fact, going on the dendrogram alone, you could challenge the traditional assumption that the author of 1 and 2 Chronicles was the same person.

However, the rolling delta suggests that the genealogies do have an impact on the stylistic similarities of 1 Chronicles with 2 Chronicles. Where the genealogies are removed, the narrative’s similarity to 2 Chronicles increases to the extent that there are sections where the style is almost identical.

It is the difference between these two methods that has led me to question whether these lists do have a relevant impact on 1 Chronicles’ style. And if it does have an impact, does this make it messy data?