Corpus research in linguistics and beyond

The Language of Immigration in the Victorian Press: A Historian's Perspective on Corpus Linguistics

Ruth Byrne (Lancaster University)
31 May (5.00pm – 6.30pm)
Room G/8 in the Waterloo Bridge Wing (ground floor) - Waterloo Campus

Abstract: Trends towards digitisation have generated a wealth of resources for historians. However, this abundance can very swiftly become overwhelming. A search for the term ‘alien’ or ‘refugee’ in a digitised newspaper can produce hundreds of thousands of hits. The prospect of having to click on these results one at a time, opening individual articles which then require reading in detail, is enough to dishearten even the most patient and dedicated researcher. One means of approaching this historical ‘big data’ is corpus linguistics.

This paper will introduce my research, which uses the British Library’s Nineteenth Century Newspaper Collection to explore the language surrounding immigration in the Victorian press. It will do so from the perspective of a historian who has newly embraced a corpus linguistic approach. The paper will also outline some of the methodological implications which resulted from the shift between disciplines. The re-contextualisation of historical sources as digitised corpus data raises questions about how integral a source’s materiality is to our interpretation of it. However, the shift in form also opens up exciting new research directions, some of which are unimaginable via a manual reading of the texts.

Series overview

Broadly defined, corpus linguistics is concerned with the study of fairly large collections of electronically available written and/or spoken texts (‘corpora’, the singular form is ‘corpus’) using a range of software (e.g. Wordsmith Tools, Sketch Engine, AntConc). Researchers use this software to gain insights into the linguistic features of these texts which manual analysis alone would not allow. For example, it is possible to show words that are particularly prominent in a corpus (called ‘keywords’). In recent years, corpus linguistics has appealed to researchers from a wide range of disciplinary backgrounds.

Alongside its original application in lexicography and language description, corpus linguistics is now used to inform various linguistic and non-linguistic areas of research. Many linguists look to the large amounts of data afforded by corpora as a way of empirically validating and extending existing theory (e.g. cognitive and critical linguists). Concerning the interdisciplinary applications of corpus linguistics, we find contexts as diverse as the language of psychopaths (Hancock et al. 2011), foreign land acquisition (Castañeda 2015) and political science (Beigman Klebanov et al. 2008), to list but a few. There are also applications directly relevant to practitioners working in areas such as language teaching and public health communication. For instance, the analysis of language use in specific contexts, e.g. business communication and apprentice and expert academic writing, has had considerable utility for English language teaching (Scott and Tribble 2006).

This seminar series will address a wide range of applications of corpus linguistics, including:

  • Corpus linguistic research exploring questions generated in other linguistic sub-disciplines, e.g. cognitive linguistics, systemic functional linguistics, sociolinguistics and critical discourse analysis
  • The use of corpus linguistic methods and tools to answer questions in various disciplines in the social sciences and humanities (e.g. anthropology, literature, criminology and health science)
  • Potential synergies between corpus linguistic methods and methods from other linguistic and non-linguistic sub-disciplines, e.g. corpus-assisted discourse analysis
  • Corpus-based language descriptions with a clear ‘applied’ dimension focussed on implementing changes at the level of practice and policy, e.g. public health communication, forensic linguistics and language pedagogy

The seminars will explore challenges, issues and opportunities faced by research that extends the scope of corpus linguistics. Guest speakers will present original empirical research and we will discuss key methodological issues. 


Beigman Klebanov, B., Diermeier, D., and Beigman, E. (2008). Automatic annotation of semantic fields for political science research. Journal of Language Technology and Politics 5(1):95-120.

Castañeda, R. R. (2015). Land Acquisition and the Semantic Context of Land within the Normative Construction of "Modern Development". In E. Osabuohien (Ed.),Handbook of Research on In-Country Determinants and Implications of Foreign Land Acquisitions (pp. 63-82). Hershey, PA: Business Science.

Hancock, J. T., M. T. Woodworth, and S. Porter. (2013).Hungry like the wolf: A word‐pattern analysis of the language of psychopaths. Legal and criminological psychology 18.1 102-114.

Scott, M., and C. Tribble. (2006) Textual patterns: Key words and corpus analysis in language education. Vol. 22. John Benjamins Publishing.

Rigour, Relevance, Reflection: CL Methodology Through a Critical Lens

Professor Gerlinde Mautner
Wednesday 8 February at 5pm

Abstract: It’s been more than 20 years since corpus linguistics first came to be harnessed to projects in Critical Discourse Analysis (CDA). While the combination of CDA and CL has certainly been fruitful, many epistemological issues remain. The explanatory power of corpus-based evidence is still a source of concern, and the combined approach remains virtually unknown in discourse studies outside linguistics. Furthermore, in spite of the field’s long tradition, many researchers embarking on such projects for the first time often struggle to come to grips with key questions of research design. They often blame themselves for this, when in fact they have simply come up against difficulties that lie in the nature of the research process and are therefore shared by more senior colleagues.

The aim of this paper is to critically examine issues such as the following:

  • Managing the fit between research questions, data and method
  • Dealing with the relationship between quantitative and qualitative evidence
  • Safeguarding against pitfalls in interpretation
  • Assessing the strength of empirical claims
  • Not letting analytical tools control the research process

It will be argued further that whether such problems are dealt with satisfactorily (rather than 'solved') depends on four critical success factors: knowledge, intuition, trial and error, and experience.

Posthumanism and Deconstructive Arguments: Corpora and Digitally-Driven Critical Analysis

Kieran O'Halloran
9 November 5pm-6:30pm

Experts or natives? Using corpus methods in a critical appraisal of the notion of ELF/A
Chris Tribble (King’s College London)

18 November 1.00pm-2.30

Dr Chris Tribble presented preliminary results from an investigation of current literacy practices in academic communication, leading into a broader discussion about corpus development and analysis. Abstract (PDF) 

Making corpus linguistics work for you: A case study in using corpus linguistic tools in applied research
Zsófia Demjén (The Open University) and Elena Semino (Lancaster University)

11 March 1.10pm-2.40pm

Professor Elena Semino and Dr Zsofia Demjen gave a talk on how corpus linguistic tools can be applied to identify metaphors in the context of medical communication. The data consisted of interviews with and online forum contributions by people with progressive cancer, family carers and healthcare professionals. The findings had implications for patient groups, charities and healthcare professionals. Abstract (PDF)

‘You shall know a word by the company it keeps’: Applying collocation analysis to investigate the relationship between language and gender

Charlotte Taylor (University of Sussex)
3 June 2016, 1pm-2.30pm

In this this third paper in the ‘Corpus research in linguistics and beyond’ series, I looked at what we can do with a corpus through the use of two case-studies, both of which examine aspects of the relationship between language and gender. The first case-study compared the way the terms ‘girl’ and ‘boy’ are used in British newspaper discourse and the second investigated whether labels such as ‘sarcastic’ are actually gendered terms. Through discussion of these case-studies, the aim was to show a range of approaches and tools that non-linguists might use to answer research questions in their disciplines. Both case-studies continue the theme of the previous two talks in this series by examining multi-method approaches. In this case, corpus methods are combined with discourse analysis and pragmatics, and the corpus data is also used in survey construction to elicit further data.  Both case-studies also make use of the notion of collocation, a key concept in corpus linguistic work. Collocation refers to the tendency of certain words or phrases to occur together with other words and phrases. Or, to use Firth’s famous definition, that idea that ‘You shall know a word by the company it keeps’. In the case-studies, I look at how we can investigate collocation using different tools, which give us different ‘ways in’ to our data, and how we can interpret what collocation can tell us about the topic we are investigating.



