Automated essay scoring has gone through fundamental changes and continuous improvements in the natural language processing community. Modern AES systems show strong correlations with human scoring. However, they face resistance because of validity concerns and poor alignment with teaching values. Our analysis shows major limitations in their consistency and reliability when compared to human raters. Human evaluators alone demonstrate high intraclass consistency.
The landscape of automated essay scoring with machine learning has transformed over the last several years.
Defining the Scope of Automated Essay Scoring in 2025

Image Source: ResearchGate
Understanding how automated essay scoring works in education helps define its scope.
Task Definition and Scoring Objectives
AES works through several assessment frameworks. The old AES systems mostly used basic data like word frequency and sentence length to review essays.
Feature-based AES has grown over almost 50 years.
Common Datasets: ASAP, TOEFL11, and PERSUADE
Three main datasets are the foundations of automated essay scoring research:
ASAP (Automated Student Assessment Prize) – This popular collection has 12,978 essays from eight prompts written by 7th-10th graders .The essays cover persuasive, source-dependent response, and narrative writing .ASAP became the standard for testing AES systems after a Kaggle competition in 2012 .TOEFL11 – This dataset has 12,100 vacation essays from non-native English writers .Unlike ASAP, TOEFL11 only gives overall scores of ‘low,’ ‘medium,’ and ‘high’ .Each prompt has 1,100 essays spread evenly .PERSUADE – PERSUADE 2.0 has about 13,000 essays scored by 20 experts using double-blind rating and full review .Essays get scores from 1-6, where 6 means excellent writing skills .Expert raters agreed strongly before final review, with a weighted κ of 0.745 and r = 0.750 .
Evaluation Metrics: Quadratic Weighted Kappa and Pearson Correlation
Researchers use several ways to measure how well AES systems work.
Pearson correlation (r) shows how consistent the rankings are between two sets of scores.
From Feature-Based to Neural Approaches in AES

Image Source: ResearchGate
Automated essay scoring systems have seen a radical alteration in natural language processing approaches. AES development has moved away from human expertise and now makes use of machine intelligence for feature extraction.
Manual Feature Engineering in Early AES Systems
AES systems initially used feature-engineering approaches. Experts designed textual features to predict scores.
A Neural Approach to Automated Essay Scoring
Researchers started learning automatic feature extraction approaches based on deep neural networks to eliminate feature engineering needs.
Neural networks automatically learn hierarchical features from raw data.
Prompt-Specific vs Prompt-Agnostic Models
Cross-prompt automated essay scoring has become crucial for developing practical AES systems.
Breakthroughs in Language Models and Automated Essay Scoring
Image Source: Level Up Coding – Gitconnected
Transformer architectures have taken automated essay scoring to new heights of accuracy and reliability. These advances are different from older models. Transformer-based approaches now offer better contextual understanding and can adapt more easily.
Transformer-based Models for Essay Understanding
Transformer models have altered the map of how AES systems understand student writing.
German BERT variants work better than traditional regression methods by a lot.
Fine-tuning Pretrained Language Models for Trait Scoring
Customizing pretrained language models to specific assessment criteria is a breakthrough. Modern systems don’t just give overall scores.
Cross-Prompt Generalization with Domain Adaptation
Domain adaptation techniques help score essays from prompts the system hasn’t seen before.
Achieving 98% Human-Level Accuracy: Empirical Evidence
Image Source: Nature
Studies show remarkable progress in automated essay scoring. Recent standards reveal unprecedented matches between how machines and humans grade essays. These developments deserve a closer look to understand what they mean for educational assessment.
Benchmarking GPT-4 and Gemini on 4,819 Essays
Language models’ essay scoring abilities were tested against human raters.
Calibration Frameworks for Human-AI Agreement
The quickest way to optimize AI-human matches has been calibration techniques.
Scoring Consistency and Rater Bias Analysis
Current systems show both strengths and weaknesses in consistency metrics.
Challenges and Future Directions in AES Research
Image Source: AIPRM
Automated essay scoring has made huge strides, but some big hurdles still stand in the way of truly reliable systems. These challenges will shape how research moves forward in this field.
Not Enough Data and Labeling Problems
Few-shot learning in different areas runs into problems when there aren’t many examples to work with.
Handling Different Types of Document Reviews
Using Chain-of-Thought to Explain Scores
Conclusion
Automated essay scoring has seen remarkable changes over the years. What started as simple feature-engineering has now grown into sophisticated neural network architectures. The path to reaching 98% human-level accuracy marks a huge milestone in educational technology. Early AES systems didn’t work as well as human evaluators. However, recent transformer-based models have made this gap much smaller.
A major breakthrough came when AES moved from manual feature engineering to neural approaches. Old systems depended on pattern matching and statistics. Modern systems now use deep learning to find meaningful features on their own. BERT and RoBERTa transformer architectures have shown they can capture both lexical features and semantic properties. This improves writing quality assessment beyond basic metrics.
The rise of fine-tuned pretrained language models has changed how traits are scored. Today’s systems review organization, main idea support, and language usage with great precision. Domain adaptation techniques help systems apply what they learned from scored essays to new prompts. This solves one of the biggest problems in AES.
Claims of 98% human-level accuracy need careful review. Studies show GPT-4 and Gemini match human evaluations well, but some differences exist. Calibration frameworks help optimize how AI and humans agree on scores. Yet scoring consistency changes based on demographic groups and essay types. These differences show we need to keep improving these systems.
AES research faces several hurdles. The lack of data and labeling create bottlenecks. Few-shot learning struggles with limited samples, which makes it hard for models to work with new situations. Systems also need to get better at handling different types of content. Chain-of-thought validation needs to improve to explain scores more clearly and accurately.
The future looks bright for automated essay scoring, but we must stay watchful. These systems now work almost as well as humans and offer new ways to speed up assessments while giving quick feedback. They must balance speed with fairness to evaluate all students equally. Automated essay scoring stands at a crucial point – ready to change educational assessment while tackling basic questions about validity, reliability, and teaching alignment.
✨ Notie AI – The AI that corrects your papers