DISCORE (DIScourse and COrpus REsearch Group) is an interdisciplinary research group based in the Department of Foreign Language Education at Middle East Technical University, bringing together researchers working at the intersection of language, discourse, spoken and written interaction and corpus research.

In this seminar, we will showcase our diverse, methodologically innovative uses of corpus methodology by focusing on three projects – 1) the compilation of the first spoken corpus of Turkish youth language (CoTY) which adopts a CADS approach to identify topical, lexical characteristics and specifically genre-specific interactional markers in contemporary Turkish youth spoken interaction 2) the use of a multimodal Social Media Influencer Corpus (SMIC) to analyse translanguaging practices of Turkish vloggers, 3) the use of corpus linguistics to develop evidence-based insights into English-medium instruction through an analysis of 180 hours of classroom interactional data of English-Medium Instruction Corpus (EMIC) collected from EMI learning-teaching environments in Türkiye.

Full details of the projects are listed below the Speakers.


  • Dr Hale Işık-Güler is Associate Professor of Linguistics at METU, FLE and DISCORE research group leader.
  • Dr Esranur Efeoğlu-Özcan is a visiting research fellow in The Centre for Language, Discourse & Communication at King’s College London and DISCORE coordinator.
  • Dr Hülya Mısır is a visiting research fellow in the School of English, Drama and Creative Studies at the University of Birmingham.
  • Pınar Turan is a research assistant in her Ph.D. candidacy in Language Studies at the Department of Foreign Language Education at METU.

Full details of the projects

1) The Corpus of Turkish Youth Language (CoTY) by Dr. Esranur Efeoğlu-Özcan

The defining linguistic characteristics of Turkish youth interaction have been invisible within both Turkish linguistics and cross-linguistic studies so far. To fill this gap and provide baseline data for further studies on contemporary spoken Turkish and cross-linguistic youth language studies, Dr. Efeoğlu-Özcan compiled the first corpus of youth language for Turkish. The Corpus of Turkish Youth Language (CoTY) is a specialized spoken corpus of 168,748 tokens constructed using EXMARaLDA software. It has a single domain of informal conversation exclusively among friends between the ages of 14-18 from various socio-economic backgrounds in Türkiye. The data is naturally occurring and spontaneous interactional data in Turkish along with occasional code-switches to English, as well as some words or expressions from French, Russian and Japanese. While CoTY is designed to encompass various modes and mediums of youth interaction and expand over the years, the current version focuses on spoken data. The corpus has 123 unique speakers (62 females and 61 males) and consists of 49 conversations which correspond to 26 hours 11 minutes of interaction.

Within the scope of the first phase of investigating Turkish youth language, CoTY was explored in terms of topical and lexical characteristics, and the analysis identified genre-specific interactional markers (Ruhi, 2013) frequently used in Turkish youth interaction. Four categories of interactional markers (i) response tokens, (ii) vocatives, (iii) vague expressions, and (iv) intensifiers were the main foci of further analysis in the project. For each category of markers, the types, patterns and salient pragmatic functions were examined adopting a corpus-assisted discourse analysis approach (Partington, 2004). It is hoped that this project will provide baseline data for future studies on contemporary spoken Turkish and cross-linguistic youth language research.

2) The Social Media Influencer Corpus (SMIC) by Dr. Hülya Mısır

Vlogging is a mode of social media which has surged in popularity over the past decade with the rise of platforms like YouTube. As a multimodal format, vlogs are semiotically complex, featuring multi-party interactions, concurrent oral and written communication, and diverse linguistic codes or modes. Analyzing language datasets derived from vlogs poses significant challenges due to this complexity. Unorthodox language use, including translanguaging practices, often remains overlooked, and to address this, a specialized corpus approach is proposed as a beneficial method. In this line, Dr. Mısır created a specialized social media corpus of vlogs on YouTube to investigate linguistics practices and multimodal communication among Turkish social media influencers.

In this presentation, I discuss how a Social Media Influencer Corpus (SMIC) can be used to analyze translanguaging phenomena within the (Turkish) influencer vlogging context (Mısır and Işık Güler, 2023). The corpus consisted of 30 videos from 6 macro influencers, encompassing 12 hours and 37 minutes of content (120,928 tokens) posted between 2020 and 2021. Using ELAN software, I preprocessed the vlogs, including metadata entry, time-aligned segmentation, and manual transcription. Through ad hoc annotation, I identified patterns of translanguaging practices where influencers seamlessly blended languages and created hybrid linguistic repertoires. The findings illustrate the co-occurrence of standardized linguistic codes and non-standardized forms, organic evolution of lexical innovations, such as net neologisms and genre-related digital lexis, phonetic transliterations, idiosyncratic expressions, and marketing terminology. Rigid linguistic boundaries are therefore consistently challenged by social media influencers, and these linguistic practices suggest a broader disconnect between the actual linguistic practices and the nationalist monolingualism (Li, 2016, 2018) promoted by state media and political actors in Türkiye.

3) The construction of the English Medium Instruction Corpus (EMIC) by Pınar Turan (PhD Candidate)

English-medium instruction (EMI) is a booming educational language policy in countries where English is not the primary language of communication. Apart from the realities brought about by this model, new unknowns that require adaptive planning for (a) face-to-face, (b) hybrid, and (c) online courses continue to emerge in higher education. Still, despite the call for “evidence-based and data-led” research into what is happening in EMI classrooms, there seems to be a dearth of research exploring these novel and relevant learning environments. Consequently, our ongoing corpus project sets out to create a sphere to explore EMI classrooms from an emic perspective.

EMIC (English-Medium Instruction Corpus) consists of 180 hours of classroom interactional data gathered from EMI learning-teaching environments in Türkiye. Covering 65 programs in four different universities, including theoretical and applied life sciences and humanities, the scope of EMIC encompasses three different course types: (1) lecture/direct lecture, (2) interactive seminar, and (3) laboratory/studio (i.e., active learning environments), while ensuring a total of 180 hours of video data from these research sites. As the corpus aims to offer a multifaceted database lending to both corpus-driven, as well as interactional analyses (e.g., Conversation Analysis) and fully utilize the CL-CA interface, the transcription and annotation processes went through unique challenges. Transcribed through Transana software, the database is nearly completed, while the issues of annotation (e.g., standards, reusability, etc.) in relation to database development (e.g., adaptability, expandability, user interface, etc.) continue being addressed.

EMIC aims to be a nexus for the growing field of EMI through rigorously developed data-led insights for a diverse body of stakeholders, including practitioners and educational policy-makers. Our preliminary engagement with data appears to identify the divergence amongst disciplines; distinct linguistic, paralinguistic, and multimodal features of face-to-face, online, and hybrid lectures; footprints of sociolinguistic phenomena in contemporary EMI classrooms, including translanguaging and other approaches to multilingualism; as well as pedagogical strategies regarding teacher talk. In further steps of the project, the findings are to be harvested into a data-driven learning (DDL) module for EMI university professors, thus leading the way forward.

