File size: 6,835 Bytes
f911796 01dc807 f911796 01dc807 f911796 01dc807 f911796 01dc807 f911796 01dc807 f911796 01dc807 f911796 01dc807 f911796 01dc807 f911796 01dc807 f911796 01dc807 f911796 01dc807 f911796 01dc807 f911796 01dc807 f911796 01dc807 f911796 01dc807 f911796 01dc807 f911796 01dc807 f911796 01dc807 f911796 01dc807 f911796 01dc807 f911796 01dc807 f911796 01dc807 f911796 01dc807 f911796 01dc807 f911796 01dc807 f911796 01dc807 f911796 01dc807 f911796 01dc807 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 |
---
library_name: transformers
tags:
- trl
- sft
license: apache-2.0
base_model: Qwen/Qwen2.5-1.5B-Instruct
datasets:
- IsmaelMousa/books
metrics:
- accuracy
- f1
- precision
- recall
- cohen_kappa
- rmse
model-index:
- name: Qwen2.5-1.5B-Instruct-Books-19K
results:
- task:
name: Text Generation
type: text-generation
dataset:
name: IsmaelMousa/books
type: IsmaelMousa/books
config: IsmaelMousa/books
split: train
args: IsmaelMousa/books
metrics:
- name: Accuracy
type: accuracy
value: 0.1200
- name: F1
type: f1
value: 0.0973
- name: Precision
type: precision
value: 0.1739
- name: Recall
type: recall
value: 0.1198
- name: Cohen Kappa
type: cohen_kappa
value: -0.1549
- name: RMSE
type: rmse
value: 1.6248
language:
- en
pipeline_tag: text-generation
---
# Qwen2.5-1.5B-Instruct-Books-19K
This model is a fine-tuned version of [Qwen/Qwen2.5-1.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct) on the Books dataset for Essay Grading.
- **Workflow:** GitHub Repository: [https://github.com/IsmaelMousa/automatic-essay-grading](https://github.com/IsmaelMousa/automatic-essay-grading).
- **Base Model:** Qwen2.5-1.5B-Instruct: [https://doi.org/10.48550/arXiv.2412.15115](https://doi.org/10.48550/arXiv.2412.15115).
- **Fine-tuning Dataset:** Books-19K: [https://github.com/IsmaelMousa/Books/19K](https://github.com/IsmaelMousa/automatic-essay-grading/blob/main/data/books/clean/entries/train/200_entries.csv).
- **Task:** Automatic Essay Grading (Text Generation).
[](https://api.wandb.ai/links/ismael-amjad/783p4r3l)
## Dataset
The Books dataset is a synthetic collection of essay-style data points generated using public domain literature and
large language model prompting. The dataset comprises a total of 300 entries and is built from six classic books. Four
of these: *The Life of James Watt*, *The Life of Julius Caesar*, *The Moonstone*, and *North and South*; were used
during the training phase, while the remaining two: *The Life of Napoleon* and *Sense and Sensibility*; were held out
for benchmarking purposes. Each book contributed exactly 50 entries, leading to a structured split of 200 training
samples and 100 benchmark samples.
All entries were generated using Le Chat Mistral, a model developed by Mistral AI. A carefully crafted prompt was used
to ensure each generated entry included a question, a reference answer written by an expert, a student answer meant to
simulate a real-world response, a mark scheme outlining the grading criteria, a score between 1 and 4, and a rationale
explaining why the score was assigned. The prompt enforced strict quality control: no duplicate questions or answers
were allowed, all required fields had to be present, and the scoring range was strictly limited to valid values. The
final output was formatted as CSV files to maintain consistency and ensure compatibility with downstream processing.
For more details, the metadata can be accessed at: [metadata](https://github.com/IsmaelMousa/automatic-essay-grading/blob/main/data/books/metadata.py).
## Modeling
The modeling approach for this study was carefully designed to evaluate the performance of different large language models (LLMs) on the automated essay grading task. We selected the Qwen2.5 architecture to represent a range of model sizes: 0.5B, 1.5B, and 3B. Each model was instruction-tuned on the Books dataset in varying sizes, with hyperparameters optimized to balance computational efficiency and performance. The experiments were conducted on GPU-accelerated hardware, leveraging techniques such as gradient checkpointing, flash attention, and mixed-precision training to maximize resource utilization.
## Evaluation
The evaluation methodology employed both quantitative metrics and qualitative analysis. For quantitative assessment, we computed accuracy, precision, recall, F1 score, root mean squared error (RMSE), and Cohen's kappa score (CKS) for the scoring task, while using BERT-Score precision, recall, and F1 for rationale evaluation. On a held-out test set of 100 samples. Qualitative examination of models' outputs revealed cases where most of the models correctly identified key aspects of student answers but sometimes failed to properly align its scoring with the rubric criteria.
### Evaluation results for `score` and `rationale` outputs:
| **Aspect** | **F1** | **Precision** | **Recall** | **Accuracy** | **CKS** | **RMSE** |
|:----------:|:------:|:-------------:|:----------:|:------------:|:-------:|:--------:|
| Score | 0.0973 | 0.1739 | 0.1198 | 0.1200 | -0.1549 | 1.6248 |
| Rationale | 0.4921 | 0.4872 | 0.5001 | -- | -- | -- |
## Usage
Below is an example of how to use the model with the Hugging Face Transformers library:
```python
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
import torch
checkpoint = "IsmaelMousa/Qwen2.5-1.5B-Instruct-Books-19K"
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
tokenizer = AutoTokenizer .from_pretrained(checkpoint)
model = AutoModelForCausalLM.from_pretrained(checkpoint)
assistant = pipeline("text-generation", tokenizer=tokenizer, model=model, device=device)
question = input("Question : ")
reference_answer = input("Reference Answer: ")
student_answer = input("Student Answer : ")
mark_scheme = input("Mark Scheme : ")
system_content = "You are a grading assistant. Evaluate student answers based on the mark scheme. Respond only in JSON format with keys 'score' (int) and 'rationale' (string)."
user_content = ("Provide both a score and a rationale by evaluating the student's answer strictly within the mark scheme range,"
" grading based on how well it meets the question's requirements by comparing the student answer to the reference answer.\n"
f"Question: {question}\n"
f"Reference Answer: {reference_answer}\n"
f"Student Answer: {student_answer}\n"
f"Mark Scheme: {mark_scheme}")
messages = [{"role": "system", "content": system_content}, {"role": "user", "content": user_content}]
inputs = tokenizer.apply_chat_template(messages, tokenize=False)
output = assistant(inputs, max_new_tokens=128, do_sample=False, return_full_text=False)[0]["generated_text"]
print(output)
```
### Frameworks
- `datasets-3.6.0`
- `torch-2.7.0`
- `transformers-4.51.3`
- `trl-0.17.0`
- `scikit-learn-1.6.1`
- `bert-score-0.3.13`
- `json-repair-0.46.0`
|