Skip to main content
Technology & ScienceUniversity

Evaluating the Reliability and Acceptability of AI Evaluation and Feedback of Medical School Course Work


GKT students at KCL submit over 2000 student essays annually. Marking timelines are tight, especially for NHS-employed clinicians. Like all written assignments, we battle "hawk and dove" effects where some markers lean towards severity while others are lenient. These elements contribute to student dissatisfaction, an issue echoed in NSS responses. 

Emerging technology holds potential to mitigate these challenges. As outlined in a recent systematic review (González-Calatayud et al, Appl. Science: 2021; 11, 5467) machine learning models can be trained on marking rubrics using Natural Language Processing (NLP) algorithms. These present an opportunity to improve marking speed and consistency. However, application of these advances in medical education requires careful study. The primary goal is to evaluate the practicality and acceptability of incorporating AI within the marking system for MBBS written essays at KCL. 


The initiative will use a mixed-methods approach. The first and third phases adopt qualitative techniques, the second phase is quantitative. 

Phase 1: Initial Stakeholder Engagement - A focus group will explore perspectives surrounding AI marking. The group composition will be purposively sampled, ensuring all major stakeholders - students and faculty - are represented. The session will be via MS Teams, and will be audio-recorded to ensure fidelity to participants' perspectives. We will use a grounded theory approach. This means we'll start without preconceived theories and will instead allow the data (the transcript) to guide development of themes. The iterative process of coding and categorising data will help in identifying recurring patterns, insights, or concerns voiced by participants, which will inform our understanding of their perceptions and expectations from introduction of AI into the marking system. 

Phase 2: Comparison of Methods 

  • Design and Data Source: The design for phase 2 is cross-sectional. 
  • Comparisons: Three marking models will be compared—AI-only, Hybrid (AI plus human), and Human-Only. 
  • Outcomes: Project mark (%), marking time (minutes) 
  • Sample Size: 1200 essays, each 1500 words, from three years of quality improvement and elective essay submissions. 
  • Statistical Approach:  
  • Reliability and Agreement: Intraclass correlation coefficient and Cohen's Kappa will be used for assessing inter-marker reliability and agreement. 
  • Threshold Effects: ROC curves will be constructed to study threshold effects for different grade boundaries. 
  • Performance Across Dimensions: Factor analysis will help understand AI model differences by evaluating individual marking domains. 
  • Marking time: Subset analysis only where marking time is available. Comparisons of mean difference with statistical comparison using ANOVA.  

Phase 3: Final Stakeholder Engagement - Building on findings of Phase 1/2, this segment, face:face, will feature presentation of findings and seek insights from students and senior KCL faculty to plan future steps. 

Findings will be reported through the Assessment Board to School (Faculty) Education Committee. They will also be shared, as appropriate, with KCL’s Education Executive for institutional benefit and disseminated via conferences and peer-reviewed publications. This project isn't just about AI; it's a strategic shift that could influence essay assessment across KCL. Our goal is clear: faster grade delivery, reduced biases, and better, timely feedback for students. 

Project status: Ongoing

Principal Investigator