HomeAutomated Essay Scoring in 2025: New Methods Achieve 98% Human-Level Accuracy

onAugust 26, 2025

Automated Essay Scoring in 2025: New Methods Achieve 98% Human-Level Accuracy

AI Grading and Assessment Tools for Educators

9 min read

Automated essay scoring has gone through fundamental changes and continuous improvements in the natural language processing community. Modern AES systems show strong correlations with human scoring. However, they face resistance because of validity concerns and poor alignment with teaching values. Our analysis shows major limitations in their consistency and reliability when compared to human raters. Human evaluators alone demonstrate high intraclass consistency.

The landscape of automated essay scoring with machine learning has transformed over the last several years. Research activity around automated scoring systems has picked up steam in the past two decades. These systems speed up evaluation and reduce human evaluators’ workload. Neural approaches to automated scoring have improved performance. Yet studies reveal systematic differences between human and AI scoring patterns. AI models tend to be more conservative than human evaluators who show a clear leniency bias. Recent developments in NLP techniques and language models have helped achieve what many call 98% human-level accuracy in 2025. We’ll look at the evidence supporting these claims and explore what these advances mean for educational assessment’s future.

Defining the Scope of Automated Essay Scoring in 2025

Automated essay grading systems displayed on various educational platforms and devices for efficient assessment.

_{Image Source: ResearchGate}

Understanding how automated essay scoring works in education helps define its scope. AES systems in 2025 look at word count, vocabulary choice, error density, sentence length variance, and paragraph structure to create quality models. These AI tools want to improve efficiency and give consistent scores, while providing quick feedback without much human input.

Task Definition and Scoring Objectives

AES works through several assessment frameworks. The old AES systems mostly used basic data like word frequency and sentence length to review essays. They missed vital factors like text structure and meaning. Modern approaches in 2025 are much better. They include overall quality assessment and specific reviews of content, organization, word choice, and how sentences flow.

Feature-based AES has grown over almost 50 years. It shows how different writing elements matter for various types of essays. Today’s informed AES systems are much better at supervised scoring. They’ve become systematic applications that focus on better data, expanded models, and using outside knowledge. On top of that, they combine features from different language levels to understand how texts connect.

Common Datasets: ASAP, TOEFL11, and PERSUADE

Three main datasets are the foundations of automated essay scoring research:

ASAP (Automated Student Assessment Prize) – This popular collection has 12,978 essays from eight prompts written by 7th-10th graders. The essays cover persuasive, source-dependent response, and narrative writing. ASAP became the standard for testing AES systems after a Kaggle competition in 2012.
TOEFL11 – This dataset has 12,100 vacation essays from non-native English writers. Unlike ASAP, TOEFL11 only gives overall scores of ‘low,’ ‘medium,’ and ‘high’. Each prompt has 1,100 essays spread evenly.
PERSUADE – PERSUADE 2.0 has about 13,000 essays scored by 20 experts using double-blind rating and full review. Essays get scores from 1-6, where 6 means excellent writing skills. Expert raters agreed strongly before final review, with a weighted κ of 0.745 and r = 0.750.

Evaluation Metrics: Quadratic Weighted Kappa and Pearson Correlation

Researchers use several ways to measure how well AES systems work. Quadratic Weighted Kappa (QWK) and Pearson correlation are the most popular.

QWK measures how well automated and human scores match up, while accounting for random agreement. Since the 2012 Hewlett Foundation’s ASAP competition, QWK has become the go-to metric. Scores above 0.65 show that AES works well. A QWK of at least 0.70 between automated and human scoring means the system is good enough.

Pearson correlation (r) shows how consistent the rankings are between two sets of scores. It measures how closely AES and human scores follow each other. Scores closer to 1 show that AES and human scores match well.

Other key metrics include Exact Agreement (EA), which shows how often AES gives the same grade as humans, and Adjacent Agreement (AA), which counts essays scored within one point of human ratings.

From Feature-Based to Neural Approaches in AES

Diagram illustrating a common framework used in existing Automated Essay Grading systems processing student essays.

_{Image Source: ResearchGate}

Automated essay scoring systems have seen a radical alteration in natural language processing approaches. AES development has moved away from human expertise and now makes use of machine intelligence for feature extraction.

Manual Feature Engineering in Early AES Systems

AES systems initially used feature-engineering approaches. Experts designed textual features to predict scores. These systems used pattern matching and statistical-based methods to review essays. Feature-based AES has continued for almost 50 years, showing how linguistic elements shape our understanding of essay quality.

Feature-engineering analyzes hand-tuned elements like essay length, grammatical errors, spelling mistakes, and vocabulary complexity. Educational Testing Service’s E-rater used patented NLP techniques to extract linguistic features that reviewed style and content. IntelliMetric became the first AES system that used artificial intelligence. It processed more than 400 features in five groups: focus and unity, organization, development, sentence structure, and mechanics.

Feature-engineering approaches offer clear interpretability and explainability. Notwithstanding that, these methods need considerable effort to engineer and tune features for high scoring accuracy.

A Neural Approach to Automated Essay Scoring

Researchers started learning automatic feature extraction approaches based on deep neural networks to eliminate feature engineering needs. Taghipour and Ng’s 2016 paper “A Neural Approach to Automated Essay Scoring” became a crucial milestone in this development. Many DNN-AES models have emerged since then and achieved state-of-the-art accuracy.

Neural networks automatically learn hierarchical features from raw data. To name just one example, in image processing, early layers capture simple features like edges, while later layers recognize complex patterns. Neural networks learn linguistic features progressively in essay scoring, from basic syntax to complex semantic relationships.

DNN applications in AES now use word embeddings—representing words as vectors in a multi-dimensional semantic space. Different architectures have been used, including recurrent neural networks, long short-term memory networks, and transformer-based models like BERT, which has showed excellent performance in AES tasks.

Prompt-Specific vs Prompt-Agnostic Models

Research in AES has focused on prompt-specific models—systems trained for particular essay prompts with large sets of pre-graded essays. These models depend heavily on prompt-specific knowledge and don’t perform well with different prompts.

Cross-prompt automated essay scoring has become crucial for developing practical AES systems. Getting large quantities of pre-graded essays for each new prompt often proves unrealistic. Prompt Agnostic Essay Scorer (PAES) offers one solution, requiring no access to labeled or unlabeled target-prompt data during training.

Cross-prompt AES methods create models for target prompts by using scored essay data from other source prompts. New approaches include domain adversarial neural networks that build prompt-independent features and deep neural networks. These networks process sequences of part-of-speech tags instead of word sequences to alleviate prompt-specific influences.

Breakthroughs in Language Models and Automated Essay Scoring

Diagram of Transformer architecture illustrating encoder and decoder layers with multi-head attention, normalization, and embeddings.

_{Image Source: Level Up Coding – Gitconnected}

Transformer architectures have taken automated essay scoring to new heights of accuracy and reliability. These advances are different from older models. Transformer-based approaches now offer better contextual understanding and can adapt more easily.

Transformer-based Models for Essay Understanding

Transformer models have altered the map of how AES systems understand student writing. RoBERTa-based scoring systems have reached a quadratic weighted kappa (QWK) score of 0.815, which suggests high reliability when they review essays. BERT and other transformer architectures show great promise too. They capture lexical features in lower layers and semantic properties in higher layers effectively. These models excel at spotting subtle aspects of writing quality that go beyond basic metrics.

German BERT variants work better than traditional regression methods by a lot. They achieve Cohen’s Kappa scores of 0.52-0.59 while logistic regression only reaches 0.30. Transformers get these better results because they know how to process contextual relationships in text through self-attention mechanisms.

Fine-tuning Pretrained Language Models for Trait Scoring

Customizing pretrained language models to specific assessment criteria is a breakthrough. Modern systems don’t just give overall scores. They can review individual traits like organization, main idea, support, and language usage. This works by training the model with trait-specific datasets and special architectures.

Models become better at matching human rater expectations when their weights are adjusted on scoring datasets. GPT-3.5 can even score written responses to science equations with just a few training examples. This quick learning ability reduces the need for big domain-specific datasets that were needed before.

Cross-Prompt Generalization with Domain Adaptation

Domain adaptation techniques help score essays from prompts the system hasn’t seen before. GAPS (grammar-aware cross-prompt trait scoring) looks at prompt-independent syntax to build generic essay representations. Researchers also make use of meta-learning strategies. They create essay scoring tasks with source and target prompts to make systems work better across different types.

Domain adversarial neural networks are a great way to build features that don’t depend on specific prompts. Correlated linear regression methods let systems adapt flexibly when scoring essays across different prompts. These advances help AES systems take what they learn from scored essays and apply it to new, unfamiliar prompts.

Achieving 98% Human-Level Accuracy: Empirical Evidence

Histogram comparing total score percentages and counts for GPT-3.5, GPT-4, mixed, and student-only groups with error bars.

_{Image Source: Nature}

Studies show remarkable progress in automated essay scoring. Recent standards reveal unprecedented matches between how machines and humans grade essays. These developments deserve a closer look to understand what they mean for educational assessment.

Benchmarking GPT-4 and Gemini on 4,819 Essays

Language models’ essay scoring abilities were tested against human raters. A detailed study put ChatGPT-4 and Gemini through two rounds of scoring 120 essays. GPT-4o matched well with human assessments and scored slightly better correlation coefficients than GPT-4o mini. A bigger study of 1,800 essays found ChatGPT’s scores matched human graders within one point 89% of the time on a six-point scale for 943 essays. The match rate dropped to 83% for English papers and 76% for history essays.

Calibration Frameworks for Human-AI Agreement

The quickest way to optimize AI-human matches has been calibration techniques. One method uses evidence-based probability distributions from real-life decision patterns instead of manual scoring. This approach improved how well automated systems matched human decision-makers without changing the scoring system. AI and humans agree exactly about 40% of the time, which falls nowhere near the 50% agreement rate between human raters. Early calibration work focused on fixing AI’s habit of giving middle-range scores (between 2-5 on a 6-point scale).

Scoring Consistency and Rater Bias Analysis

Current systems show both strengths and weaknesses in consistency metrics. Text-davinci-003 GPT’s first and second scores had a Quadratic Weighted Kappa of 0.682 (95% CI [.626, .738]), showing “substantial” agreement by standard guidelines. Basic GPT-only models achieved just “fair to moderate” agreement (0.388, 95% CI [.271, .505]). Adding linguistic measures improved this coefficient by a lot to 0.605 (95% CI [.589, .620]). GPT-4o’s bias patterns showed it marked essays almost a point lower than humans across 13,121 essays (2.8 vs. 3.7 average). Asian American students received an extra quarter-point deduction compared to other groups, which raises serious concerns.

Challenges and Future Directions in AES Research

Teacher concerns on AI in education: plagiarism 65%, reduced interaction 62%, data privacy 42%, job displacement 30%, automation 23%.

_{Image Source: AIPRM}

Automated essay scoring has made huge strides, but some big hurdles still stand in the way of truly reliable systems. These challenges will shape how research moves forward in this field.

Not Enough Data and Labeling Problems

Few-shot learning in different areas runs into problems when there aren’t many examples to work with. Models tend to overfit when trained with limited datasets. This creates a roadblock that keeps models from working well in new situations. Getting experts to review hundreds of documents makes manual labeling expensive and slow. Researchers now look at other options like distant supervision, active learning, and weak supervision. These methods help create basic labels for unlabeled datasets.

Handling Different Types of Document Reviews

Regular AES systems don’t deal very well with subtle features like coherence and argumentation. They also can’t handle multiple types of content, which is a basic limitation. The new EssayJudge standard tests how well AES works with words, sentences, and overall writing. It shows that current systems lag behind human graders, especially when looking at how ideas connect.

Using Chain-of-Thought to Explain Scores

Chain-of-thought (CoT) explanations might sound good but often don’t match what models actually calculate. About 25% of recent CoT papers treat it as a way to understand the process. In spite of that, CoT can add fake information and pile up errors in scoring. Well-laid-out reasoning frameworks are vital for AES. They help create better feedback through feature extraction, rubric mapping, step-by-step explanations, and useful suggestions.

Conclusion

Automated essay scoring has seen remarkable changes over the years. What started as simple feature-engineering has now grown into sophisticated neural network architectures. The path to reaching 98% human-level accuracy marks a huge milestone in educational technology. Early AES systems didn’t work as well as human evaluators. However, recent transformer-based models have made this gap much smaller.

A major breakthrough came when AES moved from manual feature engineering to neural approaches. Old systems depended on pattern matching and statistics. Modern systems now use deep learning to find meaningful features on their own. BERT and RoBERTa transformer architectures have shown they can capture both lexical features and semantic properties. This improves writing quality assessment beyond basic metrics.

The rise of fine-tuned pretrained language models has changed how traits are scored. Today’s systems review organization, main idea support, and language usage with great precision. Domain adaptation techniques help systems apply what they learned from scored essays to new prompts. This solves one of the biggest problems in AES.

Claims of 98% human-level accuracy need careful review. Studies show GPT-4 and Gemini match human evaluations well, but some differences exist. Calibration frameworks help optimize how AI and humans agree on scores. Yet scoring consistency changes based on demographic groups and essay types. These differences show we need to keep improving these systems.

AES research faces several hurdles. The lack of data and labeling create bottlenecks. Few-shot learning struggles with limited samples, which makes it hard for models to work with new situations. Systems also need to get better at handling different types of content. Chain-of-thought validation needs to improve to explain scores more clearly and accurately.

The future looks bright for automated essay scoring, but we must stay watchful. These systems now work almost as well as humans and offer new ways to speed up assessments while giving quick feedback. They must balance speed with fairness to evaluate all students equally. Automated essay scoring stands at a crucial point – ready to change educational assessment while tackling basic questions about validity, reliability, and teaching alignment.

✨ Notie AI – The AI that corrects your papers

Notie AI