Skip to main content

Please note: this event has passed


To RSVP, please email: clyde.ancarno@kcl.ac.uk

Spoken data collected for ethnographic study or similar qualitative analyses may often be usefully re-operationalised as a corpus sensu stricto to allow corpus-linguistic methodologies to be applied.

This presentation explores the authors’ efforts on such a re-operationalisation effort, and presents targeted to a dataset of 82 patient-provider interactions in the context of Emergency Departments in hospitals in Australia, originally collected by Slade and colleagues (Slade et al. 2015), and more than 1.4 million tokens in extent.

Slade and colleagues' original use of the data may fairly be described as qualitative in that their study did not use corpus-based or other statistical methods. Such qualitative data collection is typical not only in various areas in the linguistic study of discourse but also in many other fields where transcriptions are collected as one source of data in what we may characterise broadly as ethnographic research. But any such collection that is of substantial extent invites analysis with corpus methods – especially since spoken corpus analysis normally uses orthographic transcription, that is, exactly what is produced in such ethnographic data collection.

What, then, is involved in getting a dataset from a bundle of transcriptions intended for qualitative analysis, to a corpus-oriented format suitable for use with modern corpus analysis software?

In their work on the Emergency Departments data, Hardie and Collins mapped the transcriptions automatically from Word documents to XML following Hardie’s (2014) “modest” approach, deriving structured markup being derived from layout information embedded in the original document largely by search-and-replace. Unfortunately, such automatic conversion was not wholly sufficient, as many features of the original data resisted conversion by such relatively naïve measures. More complex steps, combined with manual effort, were needed to address the two central problems of ambiguity and inconsistency in the transcriptions; while these pose no difficulty for the human analyst, they are major stumbling blocks for a computer.

Reflection on the difficulties they encountered in this undertaking led them to devise a series of simple recommendations for transcription practice which investigators collecting data for ethnographic or other qualitative research would be well advised to follow if they wish to leave open the possibility of accessing the data via corpus-based methods and software. Adhering to these recommendations, then, “future-proofs” non-corpus datasets of spoken language by preventing inadvertent erection of barriers to such a future re-use of the data.

Speakers

Andrew Hardie and (in absentia) Luke Collins are both academics in the Centre for Corpus Approaches to Social Science at Lancaster University.

Bibliography

Hardie, A. 2014. ‘Modest XML for Corpora: Not a standard but a suggestion’, ICAME Journal 38, pp 73-103.

Slade, D., Manidis, M., McGregor, J., Scheere, H., Chandler, E., Stein-Parbury, J., Dunston, R., Herke, M. and Matthiessen, C. M. I. M. 2015. Communicating in Hospital Emergency Departments. Heidelberg: Springer.