Corpus research in linguistics and beyond
Broadly defined, corpus linguistics is concerned with the study of fairly large collections of electronically available written and/or spoken texts (‘corpora’, the singular form is ‘corpus’) using a range of software (e.g. Wordsmith Tools, Sketch Engine, AntConc). Researchers use this software to gain insights into the linguistic features of these texts which manual analysis alone would not allow. For example, it is possible to show words that are particularly prominent in a corpus (called ‘keywords’). In recent years, corpus linguistics has appealed to researchers from a wide range of disciplinary backgrounds.
Alongside its original application in lexicography and language description, corpus linguistics is now used to inform various linguistic and non-linguistic areas of research. Many linguists look to the large amounts of data afforded by corpora as a way of empirically validating and extending existing theory (e.g. cognitive and critical linguists). Concerning the interdisciplinary applications of corpus linguistics, we find contexts as diverse as the language of psychopaths (Hancock et al. 2011), foreign land acquisition (Castañeda 2015) and political science (Beigman Klebanov et al. 2008), to list but a few. There are also applications directly relevant to practitioners working in areas such as language teaching and public health communication. For instance, the analysis of language use in specific contexts, e.g. business communication and apprentice and expert academic writing, has had considerable utility for English language teaching (Scott and Tribble 2006).
This seminar series will address a wide range of applications of corpus linguistics, including:
- Corpus linguistic research exploring questions generated in other linguistic sub-disciplines, e.g. cognitive linguistics, systemic functional linguistics, sociolinguistics and critical discourse analysis
- The use of corpus linguistic methods and tools to answer questions in various disciplines in the social sciences and humanities (e.g. anthropology, literature, criminology and health science)
- Potential synergies between corpus linguistic methods and methods from other linguistic and non-linguistic sub-disciplines, e.g. corpus-assisted discourse analysis
- Corpus-based language descriptions with a clear ‘applied’ dimension focussed on implementing changes at the level of practice and policy, e.g. public health communication, forensic linguistics and language pedagogy
The seminars will explore challenges, issues and opportunities faced by research that extends the scope of corpus linguistics. Guest speakers will present original empirical research and we will discuss key methodological issues.
Beigman Klebanov, B., Diermeier, D., and Beigman, E. (2008). Automatic annotation of semantic fields for political science research. Journal of Language Technology and Politics 5(1):95-120.
Castañeda, R. R. (2015). Land Acquisition and the Semantic Context of Land within the Normative Construction of "Modern Development". In E. Osabuohien (Ed.),Handbook of Research on In-Country Determinants and Implications of Foreign Land Acquisitions (pp. 63-82). Hershey, PA: Business Science.
Hancock, J. T., M. T. Woodworth, and S. Porter. (2013).Hungry like the wolf: A word‐pattern analysis of the language of psychopaths. Legal and criminological psychology 18.1 102-114.
Scott, M., and C. Tribble. (2006) Textual patterns: Key words and corpus analysis in language education. Vol. 22. John Benjamins Publishing.
Doing cyber-trust outside the law: A linguistic approach
Speaker(s): Professor Nuria Lorenzo-Dus and Doctor Matteo Di Cristofaro
Date and Time: 23 October (5-6:30pm)
Room: Waterloo Bridge Wing LG/11
In this presentation we will discuss a research programme that uses Corpus Assisted Discourse Studies (CADS) as an interdisciplinary methodology in order to explore how individuals and groups generate trust in digital environments that operate extra-judicially. Two case studies are selected to this end: selling drugs on the Dark Net and sexually grooming children online. Whilst clearly different in terms of the illegal activities performed, both contexts centrally involve efforts to generate trust discursively. In crypto-drug markets, vendors seek to enhance their reputation within a highly competitive environment by, for instance, offering advice about avoiding being scammed by other users / providers (Lorenzo-Dus and Di Cristofaro 2018). Similarly, sexual groomers of children invest considerable discursive effort in projecting self-identities as trustworthy adults (for they do not necessarily pretend to be minors), who ‘genuinely care’ about the children they prey on (Lorenzo-Dus et al 2016; Chiang and Grant 2018). In addition to presenting the key results of our case studies, our presentation will reflect upon the challenges and opportunities of integrating, on the one hand, methods in CADS with those used in other disciplines (specifically, public policy and machine learning) and, on the other, academic results relevant to stakeholder needs.
If you wish to attend this event, please RSVP to firstname.lastname@example.org
Seminar 2 - Professor Paul Baker (Lancaster University) - Title tbc - Tuesday 12 February (5.00-6.30 pm room tbc)
Seminar 3 - Insa Nolte (University of Birmingham) and Clyde Ancarno (King’s College London) - Corpus linguistics, anthropology and 'big' data (working title) - Tuesday 14 May (5.00-6.30 pm room tbc)
After the seminars we tend to go to a venue nearby to have a drink and/or nibbles. Feel free to join us of course.
Using the data-driven learning approach to facilitate research writing
Speaker: Professor John Flowerdew
Date and Time: 6 June (5-6.30pm)
Abstract: In this talk, I will describe some of the work I have been doing in the last few years in the field of data-driven learning, the use of corpus linguistics techniques to enhance language learning and use. In particular, I will focus on the application of data-driven learning for research writing. Doctoral (and even some Masters) students worldwide are coming under pressure to publish internationally, but they may face linguistic difficulties in getting their research published. Data-driven learning can offer support.
I will begin by briefly reviewing some of the literature on corpus-based approaches to language teaching and learning and will then describe a small-scale and a large-scale project I have been involved in during which half-day workshops have been delivered to over 500 PhD students from a great variety of disciplines. I will conclude by arguing that the data-driven learning approach can be an effective way to help post-graduate students learn to write for publication purposes independently.
The lovely, the rude and the utterly shambolic: Exploring patient experiences in a corpus of NHS feedback
Speaker: Dr Gavin Brookes (University of Nottingham)
Date and Time: 28 March (5.00-6.30pm)
Abstract: The National Health Service gathers a great deal of user feedback on its services from patients. Much of this exists in “free text” format and so represents a rich dataset. However, the amount of text generated in the thousands of feedback forms patients fill in each year makes it unfeasible to undertake a close qualitative analysis of all of it. This talk will present findings from a recent ESRC-funded project which used corpus linguistic techniques to study a 29 million-word collection of such patient feedback. The aim of the project was to help the NHS to better understand and interpret the results of its feedback so that it can maintain and improve service standards in the future. Some of the issues considered in this talk include: identifying key areas of positive and negative feedback, distinguishing those concerns that are genuinely ‘urgent’ from those that are merely frequent, comparing how different health care organisations and staff members are evaluated, and exploring how feedback differs according to patients’ demographic backgrounds. By answering these and other questions, this talk will demonstrate the strengths and pitfalls of applying corpus linguistic methods to the analysis of this type of large body of feedback data, which include navigating the challenge of generating findings that are academically robust but also of practical, applied value to health care stakeholders.
Young people’s understandings of climate change studied through corpus and metaphor analytic techniques
Speakers: Dr Indira Banner and Professor Alice Deignan
Date and Time: 6th December (4.30pm-6pm)
Room: LG/11 in the Waterloo Bridge Wing (lower ground floor) Waterloo Campus
Abstract: Climate change is likely to have a great impact on the future lives of young people, yet research into science education suggests that the topic is not well understood by school students (Schreiner et al., 2008). As educators, we are interested in how metaphor in the texts accessed by young people supports their understanding of climate change, and in this talk we report on a study which used corpus linguistic methods to explore this.
We built a corpus of around 500,000 words of academic and policy documents on the topic of climate change (Academic Corpus). We also conducted focus group interviews with around 200 young people in secondary schools in Northern England, firstly asking how they find information about climate change. We built a corpus of around 200,000 words consisting of the texts they described (Materials Corpus). We also asked the young people various questions to probe their understanding of climate change; the discussions were transcribed to build a third corpus, of around 90,000 words (Interviews Corpus). In a series of studies, we compared the three corpora using SketchEngine. In this talk we discuss how the metaphors used to talk about climate change in the Interviews Corpus differ from those in the other two. We found that much metaphor use in the Academic Corpus is ‘dead’, that is, highly unlikely to be considered figurative by the writers and readers of these texts (Knudsen, 2003). Some of the same metaphors are brought to life and consciously explored for pedagogic purposes in the Materials Corpus. For example, the greenhouse metaphor, which is highly conventionalised in the technical scientific texts in the Academic Corpus, is often used as a simile in the media texts, encouraging the reader to process it creatively. The Interviews Corpus contained evidence that young people further extend metaphors, on several occasions resulting in inaccurate understandings of climate science. Our detailed study of young people’s language use also produced evidence that many young people are not adept at handling the specialist scientific uses of polysemous words such as release, impact, record and feedback, and are thus hindered in accessing scientific concepts in the secondary school curriculum.
If you wish to attend, please rsvp to email@example.com.
Knudsen, S. (2003) Scientific metaphors going public. Journal of Pragmatics 35; 1247-1263.
Schreiner, C., Henriken, E. K. & Kirkeby Hansen, P.J. (2008) Climate education: Empowering today's youth to meet tomorrow's challenges. Studies in Science Education 41/1; 3-49.
The Language of Immigration in the Victorian Press: A Historian's Perspective on Corpus Linguistics
Ruth Byrne (Lancaster University)
31 May (5.00pm – 6.30pm)
Room G/8 in the Waterloo Bridge Wing (ground floor) - Waterloo Campus
Abstract: Trends towards digitisation have generated a wealth of resources for historians. However, this abundance can very swiftly become overwhelming. A search for the term ‘alien’ or ‘refugee’ in a digitised newspaper can produce hundreds of thousands of hits. The prospect of having to click on these results one at a time, opening individual articles which then require reading in detail, is enough to dishearten even the most patient and dedicated researcher. One means of approaching this historical ‘big data’ is corpus linguistics.
This paper will introduce my research, which uses the British Library’s Nineteenth Century Newspaper Collection to explore the language surrounding immigration in the Victorian press. It will do so from the perspective of a historian who has newly embraced a corpus linguistic approach. The paper will also outline some of the methodological implications which resulted from the shift between disciplines. The re-contextualisation of historical sources as digitised corpus data raises questions about how integral a source’s materiality is to our interpretation of it. However, the shift in form also opens up exciting new research directions, some of which are unimaginable via a manual reading of the texts.
Rigour, Relevance, Reflection: CL Methodology Through a Critical Lens
Posthumanism and Deconstructive Arguments: Corpora and Digitally-Driven Critical Analysis
Professor Gerlinde Mautner
Wednesday 8 February at 5pm
Abstract: It’s been more than 20 years since corpus linguistics first came to be harnessed to projects in Critical Discourse Analysis (CDA). While the combination of CDA and CL has certainly been fruitful, many epistemological issues remain. The explanatory power of corpus-based evidence is still a source of concern, and the combined approach remains virtually unknown in discourse studies outside linguistics. Furthermore, in spite of the field’s long tradition, many researchers embarking on such projects for the first time often struggle to come to grips with key questions of research design. They often blame themselves for this, when in fact they have simply come up against difficulties that lie in the nature of the research process and are therefore shared by more senior colleagues.
The aim of this paper is to critically examine issues such as the following:
- Managing the fit between research questions, data and method
- Dealing with the relationship between quantitative and qualitative evidence
- Safeguarding against pitfalls in interpretation
- Assessing the strength of empirical claims
- Not letting analytical tools control the research process
It will be argued further that whether such problems are dealt with satisfactorily (rather than 'solved') depends on four critical success factors: knowledge, intuition, trial and error, and experience.
Experts or natives? Using corpus methods in a critical appraisal of the notion of ELF/A
9 November 5pm-6:30pm
For more information, please click here.
Chris Tribble (King’s College London)
Making corpus linguistics work for you: A case study in using corpus linguistic tools in applied research
18 November 1.00pm-2.30
Dr Chris Tribble presented preliminary results from an investigation of current literacy practices in academic communication, leading into a broader discussion about corpus development and analysis. Abstract (PDF)
Zsófia Demjén (The Open University) and Elena Semino (Lancaster University)
‘You shall know a word by the company it keeps’: Applying collocation analysis to investigate the relationship between language and gender
11 March 1.10pm-2.40pm
Professor Elena Semino and Dr Zsofia Demjen gave a talk on how corpus linguistic tools can be applied to identify metaphors in the context of medical communication. The data consisted of interviews with and online forum contributions by people with progressive cancer, family carers and healthcare professionals. The findings had implications for patient groups, charities and healthcare professionals. Abstract (PDF)
Charlotte Taylor (University of Sussex)
3 June 2016, 1pm-2.30pm
In this this third paper in the ‘Corpus research in linguistics and beyond’ series, I looked at what we can do with a corpus through the use of two case-studies, both of which examine aspects of the relationship between language and gender. The first case-study compared the way the terms ‘girl’ and ‘boy’ are used in British newspaper discourse and the second investigated whether labels such as ‘sarcastic’ are actually gendered terms. Through discussion of these case-studies, the aim was to show a range of approaches and tools that non-linguists might use to answer research questions in their disciplines. Both case-studies continue the theme of the previous two talks in this series by examining multi-method approaches. In this case, corpus methods are combined with discourse analysis and pragmatics, and the corpus data is also used in survey construction to elicit further data. Both case-studies also make use of the notion of collocation, a key concept in corpus linguistic work. Collocation refers to the tendency of certain words or phrases to occur together with other words and phrases. Or, to use Firth’s famous definition, that idea that ‘You shall know a word by the company it keeps’. In the case-studies, I look at how we can investigate collocation using different tools, which give us different ‘ways in’ to our data, and how we can interpret what collocation can tell us about the topic we are investigating.