Introduction

Feedback is an essential component of educational instruction. Formative feedback is an important way to help students improve, especially if they are doing poorly or about average. However, it must go beyond essential feedback forms like points, grades or pass/fail to allow for such improvements. The feedback must be highly informative to enable the students to improve and meet the educational standards. Otherwise, the combination of poor feedback and no study success can initiate unproductive learning processes that are demotivating and make the students faint and hopeless (Nachtigall et al., 2020). The chance of a drop-out is, therefore, increasing with each failure.

Extensive evidence shows that such processes can be effectively broken through highly informative feedback (Hattie, 2009). According to Hattie (2009), feedback strongly affects learning success, with a mean effect size of d = 0.75. Wisniewski et al. (2020) even report a mean effect of d = 0.99 for highly informative feedback. Following their understanding, highly informative feedback includes feedback on the right/wrong, correct solution, type of processing, possibilities for improvement, hints on self-regulation and learning strategies. Such feedback provides suitable conditions for self-directed learning (Winne & Hadwin, 2008) and effective metacognitive control of the learning process (Nelson & Narens, 1994).

Especially constructed response items such as essays are useful for testing the active knowledge of learners (Livingston, 2009). Moreover, essays offer learners the possibility to reflect and consolidate knowledge. However, only a few years ago, it was simply impossible in terms of available personnel to provide highly informative feedback within large university courses for text-based tasks. Nowadays, computers and especially developments in natural language processing technology promise to provide the possibilities for implementing feedback systems by offering to automate what previously needed manual scoring labour.

Especially neural networks, particularly pre-trained transformer language models (Devlin et al., 2019) demonstrated high performance in analysing student responses in past studies (Bai & Stede, 2022; Ke & Ng, 2019). However, studies which combine the implementation of textual response scoring systems with feedback provision and, on top of that, a practical evaluation of a respective system with authentic learners are comparably rare.

An essential first step towards highly informative feedback is the design of digital learning environments and activities that provide the process and textual data needed for providing feedback for learning processes. With this case study, we provide and evaluate such a learning activity that can deliver highly informative feedback to its learners, namely a system which provides students with feedback on the content of their essays within an introductory lecture on educational psychology for teacher students at Goethe University, Frankfurt, Germany.

For this learning activity, learners must first watch a video in which a German high school student presents ten learning tips. Following this, they must write short argumentative essays to judge whether the provided learning tips are helpful based on the lecture contents. After the submission deadline, learners are displayed feedback texts informing them of the quality of their essays and which lecture content they likely need to revisit.

For this purpose, we implemented a pipeline of multiple natural language processing components for assessing the essays automatically. Most previous work on essay scoring aimed for more generalisable solutions but mainly focused on holistically assessing the overall writing qualities of the essays at hand (Bai & Stede, 2022; Ke & Ng, 2019). In contrast, our system is distinguished from other essay scoring systems because it focuses on evaluating essays' content rather than the writing style.

For scoring the essays, following basic preprocessing, our system first segments the text into sections addressing the different learning tips. The resulting sections are coded individually following a scoring rubric. We collected and created datasets to train systems for both tasks. We evaluated two transformer-based architectures, GBERT and T5, and compared their performance against multiple bag-of-word baseline models. In the final step, the predicted scores are matched against a ground truth table offered by the course instructors, and the discrepancies are used to trigger rules populating a feedback template. Following the implementation, the feedback was evaluated with another cohort of learners using a randomised controlled trial.

This paper makes the following contributions:

  • A learning activity for formative scenarios that provides learners with content-related feedback.

  • A general architecture for the automated analytical assessment of essay content in combination with feedback generation.

  • A first evaluation of the quality of the provided feedback with an authentic cohort of students.

  • A novel dataset for analytic essay scoring in German.

Background

Feedback Generation

Feedback is arguably one of the most important components of educational instruction. Especially weaker students can profit from highly informative, constructive feedback. It is meant to help students reach a desired goal by pointing out the discrepancies between their current state and the goal (Hattie & Timperley, 2007; Nicol & Macfarlane-Dick, 2006). Hattie and Timperly (2007) distinguish four different levels of feedback:

  1. 1.

    Task level: how well is a given task solved?

  2. 2.

    Process level: how well does solving this task work?

  3. 3.

    Self-regulation level: how well can a learner self-regulate while solving the task?

  4. 4.

    Self-level: personal evaluations and affect of the learner.

Feedback can address one or multiple of these levels. Moreover, according to Hattie and Timperley (2007) and Wisniewski et al. (2020), feedback—at its core—needs to address three key questions: what are the goals (feed-up), how is it going for a student (feed-back) and where to go next (feed-forward). This is in line with Narciss (2008) who proposes that feedback should both reflect on learners’ mistakes and guide them on improving.

Feedback can come in many forms and shapes and be transported using various modes, as its concrete form is usually highly dependent on the domain and task it is provided for (Di Mitri et al., 2022). In line with this, research on automated feedback systems has been comparably diverse as the literature review by Cavalcanti et al. (2021) reveals. The earliest attempts to provide automated feedback were made in the context of intelligent tutoring systems (Deeva et al., 2021; Mousavinasab et al., 2021). E.g., in terms of text-based ITS’, McDonald et al. (2013) implemented a system which provides immediate feedback to short answers using a rule-based chatbot that communicates with classifiers predicting various analytic scores that trigger rules used to assemble feedback texts. Alobaidi et al. (2013) implemented a system which uses a knowledge base in combination with rules and similarity-based scoring to generate questions. Feedback to student answers is generated by providing students with an indication of whether they were wrong or right together with the respective sample solution.

Generally, feedback systems are tailored towards one specific task or domain model, and the feedback is generated from templates by a combination of rules (Cavalcanti et al., 2021). A popular choice for implementing template-based feedback through a generic system is the OnTask system (Pardo et al., 2018), which delivers feedback texts in E-Mails. With OnTask, feedback is assembled and sent according to flexible pre-defined rules determining which parts of a given template are rendered and to whom the feedback is sent. This allows for flexible feedback on all four levels, addressed by Hattie and Timperley (2007), but it also requires the manual implementation of feedback rules and templates for each task and course.

Work aimed at providing automated feedback on essay writing has been comparably rare. Horbach et al. (2022) implemented a system for scoring learner essays according to an analytic rubric using bag-of-words classifiers based on gradient boosting. The predicted codes were then used for generating feedback, which was given to students periodically. The students could revise their essays to receive feedback on the revised version.

As this poses problems in terms of scaling feedback provision across different tasks and domains, there have been first attempts at completely automating the generation of feedback without the reliance on pre-made templates. Filighera et al. (2022) created a dataset for training models to generate feedback on short answers. They used it to train the T5 string-to-string transformer (Raffel et al., 2020) to dynamically generate feedback texts for unseen responses. However, while this seems like the most promising approach regarding flexibility and scalability, the remaining problem with this work is that their dataset targets three narrow domains, and the domain transfer capabilities of this approach have not been assessed. Moreover, the quality of the provided feedback is below that of human-provided feedback.

Automated Essay Scoring

Scoring and evaluating learners’ essays is a widely researched topic in educational natural language processing (Bai & Stede, 2022; Ke & Ng, 2019). Technology for scoring essays has been actively researched for decades, and the earliest work in the area was published by Page (1967). Past work often focused on students' writing skills rather than the content and predicted holistic scores or grades, indicating the overall quality of writing (Bai & Stede, 2022; Ke & Ng, 2019).

Most of the earlier work in this research area applied statistical machine-learning techniques combined with hand-crafted feature sets to the problem. Typical techniques include multiple variants of linear regression (Crossley et al., 2011, 2015; Landauer, 2003; Page, 1967), support vector machines/regression (Yannakoudakis et al., 2011; Yannakoudakis & Briscoe, 2012; Horbach et al., 2017; Cozma et al., 2018; Vajjala, 2018), or linear discriminant analysis (McNamara et al., 2015). Typical features include text length (Crossley et al., 2011), lexical features such as n-grams (Chen & He, 2013; Crossley et al., 2011; Page, 1967; Phandi et al., 2015; Zesch et al., 2015), syntactic features (Yannakoudakis & Briscoe, 2012), argumentation structure (Nguyen & Litman, 2018), the presence of keywords from curated lists (Crossley et al., 2011), readability (Zesch et al., 2015), semantic features and discourse structure.

Most of the more recent work in essay scoring is based on various neural network architectures and chooses an end-to-end learning approach that skips the step of defining specific features. Taghipour and Ng (2016) introduced the first neural essay-scoring systems. This system encodes an essay text as a sequence of word bigrams using a chain of convolutional and LSTM layers. The centroid of all LSTM outputs is then used as input for a sigmoid regression layer. Subsequent neural models often focused on improving the weaknesses of this approach while using the same general architecture.

In line with the recent success of pre-trained transformer language models such as BERT (Devlin et al., 2019) in natural language processing, Mayfield and Black (2020) tested whether the general high performance of such models would also translate to the use case of holistic essay scoring. However, they claimed that their models performed worse than previously published feature-based approaches, such as the one by Cozma et al. (2018) and similar to a simple naive Bayes classifier they trained on a word n-gram feature set. Similar experiments by Rodriguez et al. (2019) suggest that pre-trained transformer language models can reach performance levels on par with the best feature-based models but perform worse than the best neural ones in essay scoring. Generally, this also aligns with results from Beseiso and Alzahrani (2020). Their study compares the appropriateness of hand-crafted features, static word embeddings and BERT-based contextual word embeddings for automated essay scoring, revealing that models based on these different features can achieve similar results. Results by Firoozi et al. (2023) suggest that using BERT-based models trained via active learning in a human-in-the-loop setup can significantly reduce manual coding labour in essay scoring.

While holistic grade prediction can help automate summative assessments' scoring procedures, it is not a reasonable basis for providing learners with highly informative feedback (Horbach et al., 2017; Horbach et al., 2022). Such feedback ideally addresses multiple dimensions of learner performance, essay quality, and content dimensions. This requires implementing more specialised systems to predict labels or scores for multiple scoring dimensions (Horbach et al., 2017). Accordingly, some works went beyond the prediction of holistic scores and addressed aspects of writing quality, such as organisation or persuasiveness (Ke & Ng, 2019). Horbach et al. (2017) trained models for various scoring dimensions addressing different aspects of form, structure and language use. However, this work is focused on the writing style of the essays.

Related Work in Content Scoring

Work focused on scoring the content of student responses dealt mostly with shorter answer formats (Bai & Stede, 2022; Burrows et al., 2015; Zesch et al., 2023). Supervised state-of-the-art approaches in this area try to detect whether a sample solution is entailed in a student response (Dzikovska et al., 2013) or use the different codes from pre-defined coding rubrics as labels for text classification.

Many earlier feature-based approaches aimed at this task used lexical features, semantic similarity, and lexical overlap scores as input for various statistical learners (Bai & Stede, 2022; Burrows et al., 2015). Other approaches tried to measure the semantic similarity of responses and sample solutions directly (Andersen & Zehner, 2021; Andersen et al., 2022; Bexte et al., 2022) using semantic vector spaces produced by, for example, SBERT (Reimers & Gurevych, 2019) or latent semantic analysis (Landauer et al., 1998) to allow for unsupervised or semi-supervised scoring.

However, these approaches are mostly outperformed by transformer-based models such as BERT (Devlin et al., 2019) and T5 (Raffel et al., 2020) in recent work, which could be shown to demonstrate superior performance compared to previous feature-based approaches (Bexte et al., 2022; Camus & Filighera, 2020; Gombert et al., 2022; Sung et al., 2019). This is in stark contrast to the recent results in automated essay scoring where, so far, results are more mixed.

Semantic Segmentation

As discussed in the previous section, past work in essay scoring focused on providing either holistic or dimension-specific scores for different skills related to writing and argumentation style. In contrast, work on content scoring was mostly focused on shorter answer formats. Essays have a guiding topic and usually consist of multiple segments addressing various sub-topics of the guiding topic. How these sub-topics are structured might differ from essay to essay and task to task, making it hard to implement heuristics for thematically segmenting them. Nonetheless, in essay content scoring tasks, the validity of these different segments likely needs to be evaluated separately (Horbach et al., 2022). For this reason, essays might require segmentation before different sub-topics can be addressed differently.

In natural language processing, segmentation is a widely researched topic with a long history (Choi, 2000). The most obvious examples of this are tokenisation and sentence splitting. However, semantic segmentation procedures go beyond these basic preprocessing steps. They aim to segment texts into distinct units based on content rather than form or function. This is mainly conducted as a preprocessing step in automatic text summarisation to identify semantically coherent text sections, which are then summarised individually to acquire a final summary.

Early work aimed to split texts into sections according to the different topics they addressed. This was first conducted by measuring the semantic similarity between sentences or paragraphs (Hearst, 1997). Subsequent work used techniques from topic modelling built upon Latent Dirichlet Allocation (Blei et al., 2003) to split the texts (Misra et al., 2011; Riedl & Biemann, 2012). In line with the general success of neural networks, more recent work applied LSTM-based neural network architectures (Li & Hovy, 2014; Li & Jurafsky, 2017) or pre-trained transformer language models (Zehe et al., 2021) to the problem of classifying sentences as segment borders.

Natural Language Processing and the German Language

German is, in line with a number of other languages, one of the more established languages in the context of natural language processing. Basic preprocessing and major problems such as constituency parsing or named entity recognition have been dealt with for decades in the context of German-focused NLP, and theres isexists a wide range of models and resources available (Ortmann et al., 2019; Barbaresi, n.d.) available. Nonetheless, there still exists a discrepancy in terms of available resources compared to English. Especially more niche problems that have been dealt with for English remain unsolved for German.

German is an Indo-European language from the West-Germanic branch of the Germanic languages (Durrell, 2006). It is an inflected language with verbal and nominal inflection depending on four cases, three genders, and six tenses with a comparable high number of irregular forms (Cahill & Gazdar, 1999; Haiden, 1997). Moreover, German is highly productive in terms of word formation through compounding (Clahsen et al., 1996). This makes it more likely, compared to English, that inputs to NLP systems contain words unencountered during training. While this problem was dealt with in the past using linguistically motivated approaches, i.e., in the form of morpheme-level word splitting (Riedl & Biemann, 2016), modern NLP often has adapted more statistically motivated approaches such as using character-level n-grams as vocabulary instead of linguistically motivated units(Song et al., 2021).

In recent years, transformer language models achieved major successes in German-focused NLP. Transformer language models usually implement one of three architectures: encoder, decoder, or the encoder-decoder, which combines the previous two (Vaswani et al., 2017). Encoders are usually aimed at providing vector representations of textual input. This makes them appropriate for text-, sentence- and token-level classification and regression tasks. Decoders are trained to predict the next token for a sequence of input tokens. As a consequence, decoders and encoder-decoders can also be used for text generation. The core strength of transformers are their learned task-specific vector representations of text input.

These representations are calculated on the level of sub-word units, statistically frequent character n-grams from a respective training data set (Song et al., 2021). While these partially overlap with morphemes, they don’t carry a meaning by themselves. Through this, German word forms and compounds can be represented flexibly without the need for complex preprocessing. The attention mechanism in transformer language models encodes vector representations of these sub-words as weighted means of the representations of surrounding sub-words. Through this, the contextual semantics of these sub-words can be flexibly represented within respective vectors. As transformer-based models achieved state-of-the-art results for essential German NLP tasks such as Named Entity Recognition (Chan et al., 2020) or dependency parsing (Grünewald et al., 2021) without the need for language-specific preprocessing beyond sentence splitting and sub-word-level tokenisation, they are able to mitigate many traditional problems in processing German.

Method

Research Questions

This study aims to implement and evaluate an essay content scoring system that provides students with automated feedback on their performance. We aim to illustrate the creation of such a system with a case study to provide orientation for practitioners by implementing such a system for one particular task. This results in two concrete research goals. On the one hand, the components we implement for the system need to be evaluated. On the other hand, the reception of the feedback by learners also needs evaluation. Accordingly, we address the following research questions in this study.

  • RQ1: To what degree can we automate the analytic scoring procedures (segmentation and labelling) needed to provide highly informative feedback to learners concerning the content of their essays?

  • RQ2: How do students perceive highly informative feedback?

The Learning Activity

The learning activity we used to implement and evaluate our system gives students the task of writing essays. It was developed for educational psychology lectures at Goethe University, Frankfurt, Germany. These lectures are mainly attended by teacher students. In their lecture, students were taught the basic concepts of learning, assessment and diagnostics from the perspective of educational psychology. E.g., this involves concepts related to learning such as long-term memory, motivation or learning strategies n. For the learning activity, they were tasked to evaluate the learning tips given in a video by a German high school student. For the essays, they were asked to connect the learning tips presented in the video to these concepts and explain their choices. The individual learning tips presented in the video are shown in Table 1.

Table 1 The ten learning tips presented in the video the learning activity is using

The exact task description given to students (translated from its original German version into English using ChatGPT):

The psychological foundations of learning are an important starting point for systematically supporting students' learning. Please watch the video by It’s Leena (2017) with the player available here and analyse the ten tips provided there. Please evaluate each of the tips one by one in terms of the following aspects:

  1. 1.

    Does the tip actually promote learning, in your opinion?

  2. 2.

    Please justify your statement and, where possible, assign it to one or more of the following lecture contents:

    1. a.

      Learning strategies (repeating, organising, elaboration)

    2. b.

      Metacognition, Attention, working memory, long-term memory

    3. c.

      Motivation

  3. 3.

    In view of the lecture contents, is there an important tip that you believe should be added? Please state this if you see one.

In your explanations, please use a separate paragraph for each tip. Please refer to at least two sources in your explanations. These can be scientific sources mentioned in the previous lectures or other scientific sources. Please cite exclusively according to APA guidelines; see Excursus: APA guidelines [LINK]. It is expected that you write the text yourself. Therefore, it must not have any significant textual similarity to texts from other students or sources (slides, textbook text).

Datasets & Coding Rubrics

The dataset used to train the machine learning components is a novel one and was collected from two cohorts of students. The data stems from teacher students who had to solve the task as a compulsory assignment during the lecture. It consists of N = 698 essays written in German. The learners had to hand in their essays as PDF documents on Moodle. The PDFs were downloaded from Moodle, and the text was then extracted via PyPDF. The title page and reference list were removed using heuristics. The texts were then further segmented into sentences and tokens via the SoMaJo tokeniser using the de_CMC model (Proisl & Uhrig, 2016).

Table 2 and Fig. 1 show the general distributional properties of the resulting corpus of texts. It was used to generate training corpora for segmentation and the assignment of final codes, conducted by two separate teams of coders.

Table 2 The distributional properties of the essay corpus
Fig. 1
figure 1

Violin plots depicting the distributional properties of the essay corpus

Coding Sentence to Learning Tip Correspondence

While the students were advised to write a paragraph for each learning tip they addressed, a closer inspection of the provided data revealed that many did not follow this advice. Consequently, it was impossible to simply split the essays into paragraphs for downstream processing. For this reason, we decided to introduce an additional segmentation step for preprocessing the texts. This step was operationalised as a sentence classification problem where the correspondence between given input sentences and learning tips is determined.

INCEpTION (Klie et al., 2018) was used to code the data accordingly. For this purpose, all essays were pre-tokenized and segmented into sentences with the help of SoMaJo (Proisl & Uhrig, 2016) and then imported into the annotation tool. Following this step, three coders were trained to annotate text sections with the corresponding learning tips (see Table 3 for the exact codes). The internal recommender feature of INCEpTION was activated.

Table 3 The coding rubric applied for coding the sentence to learning tip correspondence

First, in a pilot phase, all three coders coded the same 100 essays. Following this, the average agreement between all coders was measured. For this purpose, we calculated the individual agreement between all pairs of coders with the help of Cohen’s Kappa (McHugh, 2012). Then we calculated the global average which was determined to be 0.96, indicating an ‘almost perfect’ agreement between all three coders. For this reason, it was decided that no further parallel coding was needed, and the remaining essays were distributed between the coders.

Only 434 of the 698 essays were entirely coded during this step, which resulted in a satisfactory number of 22,581 sentences. The remainder of the essays were not coded for the segmentation step. Finally, the coded essays were exported from INCEpTION as UIMA XMI files and then processed with the help of DKPro-Cassis (Klie & Eckart de Castilho, n.d.) and Pandas to transform them into a tabular data format. Table 3 also shows the distribution of codes for the resulting data set.

Coding Lecture Topic to Learning Tip Correspondence

The coding of the correspondences between learning tips and lecture topics within essays was conducted by a team of seven coders on a shared Excel sheet. This sheet contained 100 columns, one for each combination of a learning tip and a topic from the lecture (see Table 4), and rows for each essay. The coders were tasked to read the essays individually and enter a “1” into the respective cell if a lecture concept was linked to a given learning tip within a given essay and a “0” otherwise.

Table 4 The coding rubric applied for coding topic to learning tip correspondence

In a pilot phase, 100 essays were coded in parallel with each of them being coded by two out of the seven coders. Following this, Cohen’s Kappa (McHugh, 2012) was calculated. We then used these individual values to determine the overall agreement (0.87). After this step, fundamental disagreements found during coding were resolved through discussion. As the interrater reliability was determined to be high, no additional parallel coding was conducted, and the remaining 598 essays were distributed among the coders.

After the coding, the sentences corresponding to the same learning tips within the same essays were joined into overall text sections. This resulted in a total of 6789 individual segments. These were combined with the codes from the shared Excel sheet using Pandas, resulting in ten unique codes per section, to acquire a final tabular dataset for multilabel classification.

System Architecture

The system was implemented in a pipeline composed of four components. These are 1) Preprocessing, 2) Segmentation, 3) Analytic scoring and 4) Feedback generation. The input essays are extracted from PDF files which are uploaded to Moodle by the learners and segmented into sentences in the first step. Following this, the essays are segmented into sections addressing the different learning tips. These sections are then scored with regard to the lecture content learners associated a respective tip with. Finally, the predicted scores trigger feedback rules to assemble the final feedback. Finally, thefeedback texts are displayed in a custom Moodle plugin. Figure 2 depicts this pipeline schematically.

Fig. 2
figure 2

A schematic illustration of the pipeline steps of our systems

Preprocessing

The preprocessing procedures are identical to the steps described in the Datasets & Coding Rubrics section. It is intended that learners upload their essays as PDF files. From these, the text is extracted using PyPDF. Following this, title page and reference section are removed using heuristics so that only the text remains. The text is then segmented into sentences and tokens using SoMaJo (Proisl & Uhrig, 2016). The reason for choosing this tokeniser addresses a comparably novel challenge in processing German texts: dealing with gender-sensitive language. Within our data set, many students used gender-sensitive language. This practice has resulted from academic and societal debates on the role of the grammatical gender of German nouns. It is suspected that “grammatical gender doesn't just refer to gender per se, it does much more: It refers to social expectations to the sexes (gender) and thus on gender in a broader sense” (Nübling, 2018, p. 49). Moreover, empirical results suggest that the grammatical gender of nouns referring to people can evoke stereotypical thinking if no explicit reference to the gender of the described person is provided (Gygax et al., 2008).

For this reason, gender-sensitive language has become an established practice in academic writing to explicitly highlight gender diversity in cases where nouns are intended to address people of multiple genders. It is usually conducted by adding a so-called “gender asterisk” in combination with the suffix “-innen” to gendered nouns. This is meant to highlight that women and diverse people are included in these nouns. For, following this interpretation, the noun “Lehrer*innen” would refer to teachers of all genders, while the use of “Lehrer” would only refer to male teachers.

When testing different tokenisers for preprocessing the data, we observed problems in correctly processing such nouns. A non-systematic sample analysis revealed that SoMaJo was the only tokeniser that showed no problems processing gender-sensitive language in German. Other tokenisers, such as those provided by OpenNLP or Stanza (Qi et al., 2020), often interpreted the asterisk as a sentence border, which resulted in badly split sentences.

Semantic Segmentation

Following the preprocessing step, the essays are segmented into multiple sections by the learning tips these individual sections address. This is conducted in the form of a sentence-in-context classification task, where, for each sentence of an essay, it is determined whether it corresponds to a learning tip and, if yes, to which one; if it does not correspond to a tip at all, or if it introduces an additional tip instead (see Table 3).

It is possible that a sentence does not contain any direct information about which tip it corresponds to. However, valuable information is likely hidden in the surrounding sentences. Many of the essays addressed the learning tips in a similar order. This means that surrounding sentences referring to other tips might also give essential hints on the tip a sentence of interest refers to. Therefore,, while a sentence might not contain any direct information on which learning tip it addresses, its context maybe does. In such cases, sliding windows can provide this valuable information to a model. For this reason, we adopted the approach of Zehe et al. (2021) to use a sentence-level sliding window classifier for semantic segmentation so that models can access contextual information from neighbouring sentences during classification. The following equation illustrates the respective input for a sentence \({s}_{t}\) at the position \(t\) for a context window size of\(m\):

$$inp=s_{t-m}\oplus\dots\oplus s_t\oplus\dots\oplus s_{t+m}$$
(1)

To acquire the resulting final text segments, all sentences corresponding to a particular learning tip are joined into individual text segments in chronological order. The following equation illustrates this step for a set of sentences with \({s}_{{t}_{k},n}\) being the \(n\)-th sentence referring to the \(k\)-th tip \({t}_{k}\):

$$inp=s_{t_k,1}\oplus\dots\oplus s_{t_k,n}$$
(2)

GBERT-based Models

We implemented classifiers based on the pre-trained GBERT transformer language model (Chan et al., 2020) to segment our learner essays. GBERT is a BERT model (Devlin et al., 2019) specifically pre-trained for processing German data. BERT is a discriminative transformer model which uses a transformer encoder to provide contextual embeddings of the tokens within a given input sentence. BERT models are first pre-trained with a large text corpus for masked language modelling and next-sentence prediction, which equips the model with an abstract understanding of the distributional properties of language. For GBERT, this step was carried out with exclusively German data. Depending on the task, the model is then fine-tuned in a second step using a smaller corpus. For this purpose, input texts are fed into the model to acquire output embeddings. Afterwards, these are pooled in various ways and fed into a final prediction layer, e.g., a linear maximum entropy layer for multiclass classification tasks. The layers learn to encode different formal and semantic aspects of languages (Tenney et al., 2019). This can be leveraged by end users who fine-tune the model for specific tasks.

In our approach we combined GBERT with a sliding window, similar to Zehe et al. (2021). It is implemented via the BertForSequenceClassification class of the Huggingface Transformers framework (Wolf et al., 2020) and fine-tuned with the regular Trainer class this framework provides. AdamW (Loshchilov & Hutter, 2019) was used as the optimisation algorithm. As an input prompt, it is given m sentences to the left and right in addition to the sentence of interest, all separated by [SEP] tokens. This allowed the model to use signals from surrounding sentences to classify a sentence of interest.

BoW Baseline Models

We implemented multiple baseline models combining a tf-idf-encoded bag-of-words feature set with different classification algorithms. Each sentence, including context sentences, wasrepresented by a tf-idf vector for these baseline classifiers. These were concatenated to form an overall feature vector. We trained three models on top of this feature encoding using three different algorithms: Random Forests (Ho, 1995), Gradient Boosting (Mason et al., 1999) and Logistic Regression classification. We used Scikit-Learn (Pedregosa et al., 2011) and CatBoost (Prokhorenkova et al., 2018) to implement these systems.

Automatic Scoring

During this step, we predict scores for each learning tip addressed in given input essays. This step consists of two sub-steps. First, we predict which concepts from the lecture are connected to a given learning tip. This is carried out in the form of a multi-label classification problem, as learners might connect multiple concepts to a learning tip. The models are trained to predict the encountered codes for a given input segment corresponding to a learning tip (see Table 4). The inputs for training are generated during the preceding segmentation step. Following this, the predictions are used to calculate the final scores.

Code Prediction: GBERT-based Models

One of the architectures we implemented for predicting codes for input segments is based on GBERT, implemented via the BertForSequenceClassification class from Huggingface Transformers (Wolf et al., 2020) which was fine-tuned for this purpose using the regular Trainer class from the same framework. It was trained in multilabel mode using the AdamW optimiser (Loshchilov & Hutter, 2019). As input, the model used the text segments produced from the previous step. The input sentences of a given text segment were separated by [SEP] tokens.

Code Prediction: T5-based Models

Inspired by the findings from Filighera et al. (2022) that string-to-string mapping encoder-decoder transformers such as T5 (Raffel et al., 2020) are capable of educational scoring tasks, we also evaluated T5 for this problem. In contrast to discriminative transformer models such as BERT, T5 is an encoder-decoder transformer model consisting of both an encoder and a decoder component. The model was trained to generate output strings for a given input string.

For this purpose, a given input string is tokenised and transformed into contextualised embeddings using the encoder component. The autoregressive decoder component then uses these to generate an output string while attending to the embeddings of the input string. This architecture is usually aimed at tasks such as machine translation or summarisation. However, fitting T5-based models for multi-label classification is possible by simply letting them generate output strings containing labels corresponding to given input strings.

For this reason, in our case, the model wastrained to translate an input segment into an output string listing all detected codes in an arbitrary order, i.e., if a given input segment connects a learning tip to the topics of motivation and long-term memory, the resulting output string of the model would be ‘motivation long-term memory < /s > ’. We used the implementation provided by Huggingface Transformers (Wolf et al., 2020) in the form of the T5ForConditionalGeneration class. We trained the model using the AdaFactor optimiser (Shazeer & Stern, 2018).

Code Prediction: BoW Baseline Models

As for the segmentation step, we implemented baseline systems based on tf-idf scores. The rationale is that a range of keywords and phrases signifies the final codes assigned to a respective text segment. Tf-idf scores allow models to identify them and identify their respective connections automatically. We used the full input text segments to calculate a feature vector. These were fed to three machine learning algorithms we evaluated for this purpose: Random Forests (Ho, 1995), Gradient Boosting (Mason et al., 1999) (in the form of one vs. many classifiers) and Logistic Regression classification. We used Scikit-Learn (Pedregosa et al., 2011) and CatBoost (Prokhorenkova et al., 2018) to implement these systems.

Calculation of Scores

For calculating the scores, the codes predicted for a given input segment are matched against a ground truth table denoting the correct concept assignments for each learning tip. This matching process is conducted separately for the three general topic areas addressed by the learning task:

  1. A.

    Learning strategies (repeating, organising, elaboration)

  2. B.

    Metacognition, attention, working memory, long-term memory

  3. C.

    Motivation

This process can be illustrated by the following formulas which calculate the scores as proportion of correctly and incorrectly assigned codes within each content area. \(A\) refers to one of the three content areas, \(T\) to the set of all learning tips, and \(t\) to a learning tip. \(C{\prime}(A, t)\) is the set of codes from the content area \(A\) that were correctly assigned to a given learning tip \(t\), with \(c{\prime}\) denoting a given code from this set. Conversely, \(C{\prime}{\prime}(A, t)\) denotes the set of codes from the content area A that were incorrectly assigned to a given learning tip, with \(c{\prime}{\prime}\) denoting one of these codes:

$$Corr(A) ={\sum }_{t}^{T}{\sum }_{c{\prime}}^{C{\prime}(A, t)}1$$
(3)
$$Incorr(A) ={\sum }_{t}^{T}{\sum }_{c{\prime}{\prime}}^{C{\prime}{\prime}(A, t)}1$$
(4)
$$Score(A) =\frac{Corr(A)}{Corr(A) + Incorr(A)}$$
(5)

This results in scores in the range of \(0\le Score(A)\le 1\).

Feedback Generation

Similar to the approach of the OnTask system (Pardo et al., 2018), the feedback is generated from templates following hand-crafted rules. The reasoning here is that the feedback we provide addresses a narrow domain and needs to be reliable, which rendered out more experimental approaches of using generative language models for feedback provision. We implemented a custom Moodle plugin for this purpose. The feedback was implemented using the mustache templating engineFootnote 1 packed with the LMS and assembled by rules.

Depending on the scores for the three content areas, students are shown differing messages concerning which parts of the lecture content they understood well and which parts they need to repeat. Students are presented with four general texts, depending on their performance in the content areas. The performance is measured in three levels (low/medium/high), which depend on threshold values limiting the allowed discrepancies between the ground truth table and the predicted labels. A performance for a given content area is counted as low if the respective score is below 0.5. It is counted as medium if this number sits between 0.5 and 0.8 and as high if it sits above. The feedback texts address the following cases:

  1. 1.

    Overall negative feedback: A learner demonstrated a low performance in all three content areas. The learner is advised to revisit all lecture topics.

  2. 2.

    Overall medium feedback: A learner demonstrated a medium performance in all three content areas. The learner is advised to revisit all lecture topics.

  3. 3.

    Overall positive feedback: A learner performed well in all three content areas.

  4. 4.

    Mixed feedback: A learner showed a high performance in one or two content areas and a low performance in one or two content areas. The learner is advised to revisit the contents of the content areas for which they demonstrated a low or medium performance.

Figure 3 shows an example of a generated mixed feedback text.

Fig. 3
figure 3

An example of mixed feedback generated by our method as displayed by our custom Moodle plugin

Evaluation

RQ1: Evaluation of the Machine Learning Models

In the first step, we evaluated different segmentation models based on GBERT-base and GBERT-large, which were fine-tuned for three epochs against the three TF-IDF baseline models. The evaluation was done via 5 × 5 cross-validation, and F1 scores were used as the evaluation metric. It was measured for each of the 12 labels. Table 5 lists the individual results.

Table 5 The F1 evaluation results for the different segmentation approaches

As is visible, all approaches achieved overall high F1 scores > 80.00 for the segmentation step, demonstrating the general feasibility of our segmentation approach. The Additional Tip category is an exception, for which all models achieved results significantly worse than the other categories. A possible explanation might be that this category addresses additional learning tips introduced by participants. This means that, compared to the other categories, which address pre-defined learning tips, this category is expected to be more diverse concerning the observed input as the learning tips brought up might be diverse. In addition to this, this category contains by far the fewest examples out of all.

The performance of the baseline models wasstill significantly worse than the performance of the ones based on GBERT. This result is expected given the general developments in natural language processing where BERT-based models achieved state-of-the-art results for various tasks and that the baseline models operate solely on TF-IDF-encoded bag-of-word features. The model based on GBERT-large achieved the best results for all 12 labels, although they are very close to the ones achieved by the GBERT-base model.

In the next step, we evaluated the approaches for the final coding procedure. For this purpose, we again used 5 × 5 cross-validation and measured the performance of the models using F1 scores. For this evaluation step, the essays were segmented using the best model from the previous step, which was based on GBERT-large. Table 6 shows the respective results.

Table 6 The F1 evaluation results for the different approaches explored for the final coding step

The baseline based on logistic regression wassignificantly outperformed by all other models, which reached overall satisfactory F1 scores between 80 and 90. As for the segmentation step, the most potent approach is based on GBERT-large. This is followed by the approach based on FLAN-T5-large, which performed slightly worse and outperformed GBERT-large in two categories. The following best approaches based on GBERT-base and FLAN-T5-base showed a similar performance with an overall F1 difference of only 0.10 points.

The gap between the transformer-based and the baseline models based on Random Forests and CatBoost was small for the categories with many examples. However, the baselines falled short in the categories with fewer examples such as Organisation, Elaboration and Metacognition. Because the transformers operate on embeddings, i.e. word vectors that encode different task-dependent properties, and have been pre-trained on different corpora, they can better model the semantic relationship between words and phrases, which bag-of-words cannot represent because they rely on indicator words and phrases observed during training. Consequently, transformers learn to better generalise for unseen examples, which likely was the factor that put them ahead in categories with fewer examples in our case.

It is also visible that the predictions for the code Learning Strategies deviate from the other results, as the results for this code suggest that it was the most difficult for the models to predict. We could not find a clear reason for this, as the code is not less frequent than the other codes within the training set. A possible explanation could lie in the fact that learning strategies as a general concept is more broad than most of the other concepts addressed by the codes. While other codes address comparably narrow concepts, such as Organisation, or were predominately signified by a small range of keywords- and phrases, as for example for the code Motivation, Learning Strategies refers to a broader range of concepts.

RQ2: Evaluation of the Feedback

To evaluate the quality of the provided feedback, we carried out a randomised controlled trial with an additional cohort of students (N = 257) who were given the task as a compulsory assignment for the lecture Introduction to Educational Psychology at the Goethe University Frankfurt in the winter term 2022/2203. Participants were divided into two randomised groups to measure the quality of the feedback. The treatment group (N = 148) received highly informative feedback generated by our system and essential feedback on formalities (the correct usage of the APA citation style, the quality of the orthography, and the quality of the format) manually provided by student assistants. In contrast, the control group (N = 109) received only this essential feedback on formalities but no feedback related to the content of their assignments.

The feedback was provided with the help of a combination of the strongest models for segmentation and coding determined during the previous evaluation steps, which were the GBERT-large models. Both models were retrained on the complete data set for this purpose. All participants received feedback two days after the assignment submission deadline. Table 7 lists the distribution of feedback the participants from the treatment group received. As shown, most students from the treatment group received mixed feedback, i.e., they scored high in some content areas and low in others. No student who handed in an essay failed the content-related part of the task as scored by our system.

Table 7 The distribution of feedback types that were generated for the participants from the focus group

For assessing what factors influenced the perception of the feedback, we applied a survey instrument consisting of six different items. These measure comprehensiveness, helpfulness, progress, opportunities for reflection, opportunities for regulation and motivational effects of the feedback. The items reflect the general criteria for highly informative feedback as specified by Wisniewski et al. (2020). The scales for all six items were set to four (strongly disagree, disagree, agree, strongly agree), and the survey results for the individual items are used as dependent variables. To reduce noise in the data and to only take serious answers into account, we excluded all participants who filled out the questionnaire in under 10 s (N = 9).

In the next step, we defined a set of independent variables whose relationship to the feedback perception we aimed to test. Among them, the central treatment variable is a binary one that encodes whether a participant received the treatment with highly informative feedback. On the other hand, we included three variables encoding whether participants passed the different formal requirements, namely APA, format and writing quality, as these also contributed to the feedback texts. Table 8 lists all variables together with their mean and standard deviation. All variables except for the dependent ones are scaled to a range between 0 and 1.

Table 8 An overview of the six perception items used to evaluate the provided feedback

We supplemented the items with control variables measured by the FEMOLA instrument. This instrument was introduced by Pohlmann and Möller (2010). The rationale behind this is that reasons that motivate participants to choose their programme of study might also influence how they perceive the feedback. For example, it is imaginable that participants who chose the teaching profession primarily for financial security might perceive the provided feedback differently than participants who chose it rather for reasons of academic interest.

We then applied ordinary least squares regression analyses to assess the relationship between the dependent and independent variables. Collinearity was assessed by calculating the variance inflation factor of all independent variables. Variables that caused high collinearity (VIF > 1.5) were Pass Writing, Belief in own Abilities, and Easiness. As a consequence, we excluded these variables from the final analysis. Following this, we conducted the main analysis with the help of the Statsmodels library (Seabold & Perktold, 2010). Table 9 lists the summative results.

Table 9 The results of the ordinary least squares analysis

The treatment with highly informative feedback effected the perception of Helpfulness and Motivation. On the one hand, for Helpfulness, the results show a comparably strong positive effect, meaning that the highly informative feedback was perceived as significantly more helpful than the basic feedback. On the other hand, the results show a significant negative effect for Motivation, meaning that the highly informative feedback is perceived as less motivating compared to the basic one.

However, the strongest effects could not be observed for the treatment variable. Instead, they were visible for the variable encoding whether a student passed the APA requirements. Passing the APA requirements strongly affected the perception of Comprehensiveness, Motivation, Regulation, Progress and Helpfulness. These effects were all positive, meaning that passing the APA requirement has an overall stronger influence on the dependent variables than the treatment variable, i.e., participants were likely more influenced by whether they passed the APA requirements than by the treatment variable.

Discussion

RQ1

Concerning RQ 1 (To what degree can we automate the analytic scoring procedures needed to provide highly informative feedback to learners concerning the content of their essays?), it can be stated that the approaches tested performed well in nearly all evaluation categories. The best approaches for the segmentation and coding steps were based on GBERT-large, an expected result given that transformer-based models are state-of-the-art for many NLP tasks. A more interesting result is that the encoder-decoder-based T5 architecture, which, in our case, translates an input string containing an essay segment into an output string containing a list of the predicted codes, performed close to the purely discriminative BERT-based models.

As Filighera et al. (2022) showed that T5 is, to a certain degree, able to generate individual feedback texts for short responses, it is imaginable to use it for such a feedback generation step in future work. I.e., one could use T5 to generate short texts on why linking certain concepts from the lecture to a specific string was wrong. These could then be combined with the overall feedback to provide participants with more detailed information on their performance. However, before purely automatically generated feedback can be used in a practical context, an open question needs to be answered: the question of alignment. Lim et al. (2021) claim that feedback should clarify performance, facilitate self-assessment, deliver high-quality information, encourage dialogue, promote motivational beliefs and illustrate the gap between achieved and desired performance. Consequently, it needs to be assessed for purely generative feedback models what is needed in terms of architecture, prompting and training data to meet these requirements.

RQ2

Concerning RQ2 (How is the provided highly informative feedback perceived by the learners?), it can be stated that our template-based feedback was perceived positively, as indicated by the results in Table 9. We conducted a randomised controlled trial to find significant effects that contribute to the perception of the feedback, which was measured through six single items that were used as dependent variables. For this purpose, we conducted an ordinary least squares analysis. In our case, the treatment variable was whether or not a participant received highly informative. The provision of highly informative feedback had a positive effect on Helpfulness and a negative effect on Motivation.

Regarding motivation, the highly informative feedback is perceived as less motivating than the basic feedback. A possible explanation could lie in the self-determination theory (SDT) (ten Cate, 2013; Fong & Schallert, 2023). Self-determination theory was introduced by Ryan and Deci (2000). It states that the intrinsic motivation of humans is linked to three core psychological needs: feelings of competence, autonomy, and relatedness. As feedback points out discrepancies between the current performance of a learner and a desired goal, “SDT predicts that it will not help in developing a feeling of competence, pride, and consequently, intrinsic motivation. At best, the addition of strengths has a balancing effect to make trainees not feel depressive about their overall competence” (ten Cate, 2013).

The highly informative feedback gives participants more details on task performance than the minimal feedback. This also bears the possibility that more discrepancies between participants' actual and perceived performance will be pointed out. If these discrepancies are too large, and the resulting feedback is unexpectedly negative, this might hurt participants’ feelings of competence, decreasing their intrinsic motivation. This assumption is also in line with empirical results. Fong et al. (2018), who conducted a meta-analysis on the effect of feedback on intrinsic motivation, found that, in many cases, feedback highlighting improvement areas can hurt learners' intrinsic motivation.

Interestingly, the question of whether participants passed the formal APA requirements of the assignment had a stronger overall effect on the different dependent variables compared to the highly informative feedback. While we cannot causally explain this effect, this likely stems from our participants being predominately first-semester students. This might have been one of the first assignments requiring many of them to cite sources properly. Since plagiarism is highly sanctioned throughout their studies, positive reinforcement regarding whether they cited properly might lead to relief, which might explain the stronger effects.

Conclusion

To summarise, we presented and evaluated an essay writing learning activity coupled with an automated feedback system that provided learners with feedback on the correctness of the content of their writing. For this purpose, we also collected and coded a custom data set composed of 698 essays written in German. This dataset was used for evaluating the performance of the individual machine-learning components constituting this pipeline. This evaluation was highly successful, demonstrating the feasibility of the proposed approaches. The results are in line with the general success of transformer language models.

Moreover, we evaluated how learners perceived the generated feedback by setting up a randomised control study with a cohort of learners. This revealed overall positive effects regarding highly informative feedback and that our participants were, on average, more influenced by whether they passed the formal APA requirements. Given that the participants were students from an introductory lecture, this is an expected result as they do not have extensive experience in paper writing. Nonetheless, the highly informative feedback positively affected helpfulness and reflection.

Limitations and Future Work

A clear limitation of our approach is that the feedback is generated solely through templates. While this guarantees stable feedback texts compared to purely generative models as used, for example, by Filighera et al. (2022), it also limits the possibilities for personalising the feedback. Ideally, feedback should address learners' mistakes in more detail and give them individual explanations of what they did wrong instead of just directing them to the appropriate lecture content. This is only feasible to a certain degree with purely template-based feedback and requires practitioners to consider all kinds of possible mistakes to prepare appropriate templates. This, in turn, hinders the scalability of such feedback. For this reason, our future work will mainly address how we can stabilise transformer-based feedback generation methods to allow for more individualised feedback. These could then be used to either populate placeholders within feedback templates or completely generate the feedback from scratch.

To release the full potential of highly informative feedback, the teaching concept, the course's intended learning outcomes and the design of the digital learning environment need to be closely aligned (Biggs, 1996). That means that after specifying a course's learning outcomes, it is not only necessary to define the assessment of these outcomes but also think about potential learning indicators from the data that provide valuable insights into the learner's state (Schmitz et al., 2022). Examples of well-designed learning environments that provide learning activities that directly send relevant indicators of learning are manifold (Ahmad et al., 2022). A future direction of work could thus also involve research on integrating these indicators into feedback generation models to provide feedback on learning processes in combination with feedback on the outcome. Integrating learning indicators into transformer-based feedback systems could provide additional information that could be leveraged for feedback generation.