Files changed (1) hide show
  1. README.md +182 -168
README.md CHANGED
@@ -1,168 +1,182 @@
1
- ---
2
- library_name: transformers
3
- base_model: Qwen/Qwen2.5-7B-Instruct
4
- license: apache-2.0
5
- datasets:
6
- - shibing624/chinese_text_correction
7
- language:
8
- - zh
9
- metrics:
10
- - f1
11
- tags:
12
- - text-generation-inference
13
- widget:
14
- - text: "文本纠错:\n少先队员因该为老人让坐。"
15
- ---
16
-
17
-
18
-
19
- # Chinese Text Correction Model
20
- 中文文本纠错模型chinese-text-correction-7b:用于拼写纠错、语法纠错
21
-
22
- `shibing624/chinese-text-correction-7b` evaluate test data:
23
-
24
- The overall performance of CSC **test**:
25
-
26
- |input_text|predict_text|
27
- |:--- |:--- |
28
- |文本纠错:\n少先队员因该为老人让坐。|少先队员应该为老人让座。|
29
-
30
- # Models
31
-
32
- | Name | Base Model | Download |
33
- |-----------------|-------------------|-----------------------------------------------------------------------|
34
- | chinese-text-correction-1.5b | Qwen/Qwen2.5-1.5B-Instruct | [🤗 Hugging Face](https://huggingface.co/shibing624/chinese-text-correction-1.5b) |
35
- | chinese-text-correction-1.5b-lora | Qwen/Qwen2.5-1.5B-Instruct | [🤗 Hugging Face](https://huggingface.co/shibing624/chinese-text-correction-1.5b-lora) |
36
- | chinese-text-correction-7b | Qwen/Qwen2.5-7B-Instruct | [🤗 Hugging Face](https://huggingface.co/shibing624/chinese-text-correction-7b) |
37
- | chinese-text-correction-7b-lora | Qwen/Qwen2.5-7B-Instruct | [🤗 Hugging Face](https://huggingface.co/shibing624/chinese-text-correction-7b-lora) |
38
-
39
-
40
- ### 评估结果
41
- - 评估指标:F1
42
- - CSC(Chinese Spelling Correction): 拼写纠错模型,表示模型可以处理音似、形似、语法等长度对齐的错误纠正
43
- - CTC(CHinese Text Correction): 文本纠错模型,表示模型支持拼写、语法等长度对齐的错误纠正,还可以处理多字、少字等长度不对齐的错误纠正
44
- - GPU:Tesla V100,显存 32 GB
45
-
46
- | Model Name | Model Link | Base Model | Avg | SIGHAN-2015 | EC-LAW | MCSC | GPU/CPU | QPS |
47
- |:-----------------|:------------------------------------------------------------------------------------------------------------------------|:---------------------------|:-----------|:------------|:-------|:-------|:--------|:--------|
48
- | Kenlm-CSC | [shibing624/chinese-kenlm-klm](https://huggingface.co/shibing624/chinese-kenlm-klm) | kenlm | 0.3409 | 0.3147 | 0.3763 | 0.3317 | CPU | 9 |
49
- | Mengzi-T5-CSC | [shibing624/mengzi-t5-base-chinese-correction](https://huggingface.co/shibing624/mengzi-t5-base-chinese-correction) | mengzi-t5-base | 0.3984 | 0.7758 | 0.3156 | 0.1039 | GPU | 214 |
50
- | ERNIE-CSC | [PaddleNLP/ernie-csc](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/legacy/examples/text_correction/ernie-csc) | PaddlePaddle/ernie-1.0-base-zh | 0.4353 | 0.8383 | 0.3357 | 0.1318 | GPU | 114 |
51
- | MacBERT-CSC | [shibing624/macbert4csc-base-chinese](https://huggingface.co/shibing624/macbert4csc-base-chinese) | hfl/chinese-macbert-base | 0.3993 | 0.8314 | 0.1610 | 0.2055 | GPU | **224** |
52
- | ChatGLM3-6B-CSC | [shibing624/chatglm3-6b-csc-chinese-lora](https://huggingface.co/shibing624/chatglm3-6b-csc-chinese-lora) | THUDM/chatglm3-6b | 0.4538 | 0.6572 | 0.4369 | 0.2672 | GPU | 3 |
53
- | Qwen2.5-1.5B-CTC | [shibing624/chinese-text-correction-1.5b](https://huggingface.co/shibing624/chinese-text-correction-1.5b) | Qwen/Qwen2.5-1.5B-Instruct | 0.6802 | 0.3032 | 0.7846 | 0.9529 | GPU | 6 |
54
- | Qwen2.5-7B-CTC | [shibing624/chinese-text-correction-7b](https://huggingface.co/shibing624/chinese-text-correction-7b) | Qwen/Qwen2.5-7B-Instruct | **0.8225** | 0.4917 | 0.9798 | 0.9959 | GPU | 3 |
55
-
56
- ## Usage (pycorrector)
57
-
58
- 本项目开源在`pycorrector`项目:[pycorrector](https://github.com/shibing624/pycorrector),可支持大模型微调后用于文本纠错,通过如下命令调用:
59
-
60
- Install package:
61
- ```shell
62
- pip install -U pycorrector
63
- ```
64
-
65
- ```python
66
- from pycorrector.gpt.gpt_corrector import GptCorrector
67
-
68
- if __name__ == '__main__':
69
- error_sentences = [
70
- '真麻烦你了。希望你们好好的跳无',
71
- '少先队员因该为老人让坐',
72
- '机七学习是人工智能领遇最能体现智能的一个分知',
73
- '一只小鱼船浮在平净的河面上',
74
- '我的家乡是有明的渔米之乡',
75
- ]
76
- m = GptCorrector("shibing624/chinese-text-correction-7b")
77
-
78
- batch_res = m.correct_batch(error_sentences)
79
- for i in batch_res:
80
- print(i)
81
- print()
82
- ```
83
-
84
- ## Usage (HuggingFace Transformers)
85
- Without [pycorrector](https://github.com/shibing624/pycorrector), you can use the model like this:
86
-
87
- First, you pass your input through the transformer model, then you get the generated sentence.
88
-
89
- Install package:
90
- ```
91
- pip install transformers
92
- ```
93
-
94
- ```python
95
- # pip install transformers
96
- from transformers import AutoModelForCausalLM, AutoTokenizer
97
- checkpoint = "shibing624/chinese-text-correction-7b"
98
-
99
- device = "cuda" # for GPU usage or "cpu" for CPU usage
100
- tokenizer = AutoTokenizer.from_pretrained(checkpoint)
101
- model = AutoModelForCausalLM.from_pretrained(checkpoint).to(device)
102
-
103
- input_content = "文本纠错:\n少先队员因该为老人让坐。"
104
-
105
- messages = [{"role": "user", "content": input_content}]
106
- input_text=tokenizer.apply_chat_template(messages, tokenize=False)
107
-
108
- print(input_text)
109
-
110
- inputs = tokenizer.encode(input_text, return_tensors="pt").to(device)
111
- outputs = model.generate(inputs, max_new_tokens=1024, temperature=0, do_sample=False, repetition_penalty=1.08)
112
-
113
- print(tokenizer.decode(outputs[0]))
114
- ```
115
-
116
- output:
117
- ```shell
118
- 少先队员应该为老人让座。
119
- ```
120
-
121
-
122
- 模型文件组成:
123
- ```
124
- shibing624/chinese-text-correction-7b
125
- |-- added_tokens.json
126
- |-- config.json
127
- |-- generation_config.json
128
- |-- merges.txt
129
- |-- model.safetensors
130
- |-- model.safetensors.index.json
131
- |-- README.md
132
- |-- special_tokens_map.json
133
- |-- tokenizer_config.json
134
- |-- tokenizer.json
135
- `-- vocab.json
136
- ```
137
-
138
- #### 训练参数:
139
-
140
- - num_epochs: 8
141
- - batch_size: 2
142
- - steps: 36000
143
- - eval_loss: 0.12
144
- - base model: Qwen/Qwen2.5-7B-Instruct
145
- - train data: [shibing624/chinese_text_correction](https://huggingface.co/datasets/shibing624/chinese_text_correction)
146
- - train time: 10 days
147
- - eval_loss: ![](https://huggingface.co/shibing624/chinese-text-correction-7b-lora/resolve/main/eval_loss_7b.png)
148
- - train_loss: ![](https://huggingface.co/shibing624/chinese-text-correction-7b-lora/resolve/main/train_loss_7b.png)
149
-
150
- ### 训练数据集
151
- #### 中文纠错数据集
152
-
153
- - 数据:[shibing624/chinese_text_correction](https://huggingface.co/datasets/shibing624/chinese_text_correction)
154
-
155
-
156
- 如果需要训练Qwen的纠错模型,请参考[https://github.com/shibing624/pycorrector](https://github.com/shibing624/pycorrector) 或者 [https://github.com/shibing624/MedicalGPT](https://github.com/shibing624/MedicalGPT)
157
-
158
- ## Citation
159
-
160
- ```latex
161
- @software{pycorrector,
162
- author = {Xu Ming},
163
- title = {pycorrector: Implementation of language model finetune},
164
- year = {2024},
165
- url = {https://github.com/shibing624/pycorrector},
166
- }
167
- ```
168
-
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ library_name: transformers
3
+ base_model: Qwen/Qwen2.5-7B-Instruct
4
+ license: apache-2.0
5
+ datasets:
6
+ - shibing624/chinese_text_correction
7
+ language:
8
+ - zho
9
+ - eng
10
+ - fra
11
+ - spa
12
+ - por
13
+ - deu
14
+ - ita
15
+ - rus
16
+ - jpn
17
+ - kor
18
+ - vie
19
+ - tha
20
+ - ara
21
+ metrics:
22
+ - f1
23
+ tags:
24
+ - text-generation-inference
25
+ widget:
26
+ - text: '文本纠错:
27
+
28
+ 少先队员因该为老人让坐。'
29
+ ---
30
+
31
+
32
+
33
+ # Chinese Text Correction Model
34
+ 中文文本纠错模型chinese-text-correction-7b:用于拼写纠错、语法纠错
35
+
36
+ `shibing624/chinese-text-correction-7b` evaluate test data:
37
+
38
+ The overall performance of CSC **test**:
39
+
40
+ |input_text|predict_text|
41
+ |:--- |:--- |
42
+ |文本纠错:\n少先队员因该为老人让坐。|少先队员应该为老人让座。|
43
+
44
+ # Models
45
+
46
+ | Name | Base Model | Download |
47
+ |-----------------|-------------------|-----------------------------------------------------------------------|
48
+ | chinese-text-correction-1.5b | Qwen/Qwen2.5-1.5B-Instruct | [🤗 Hugging Face](https://huggingface.co/shibing624/chinese-text-correction-1.5b) |
49
+ | chinese-text-correction-1.5b-lora | Qwen/Qwen2.5-1.5B-Instruct | [🤗 Hugging Face](https://huggingface.co/shibing624/chinese-text-correction-1.5b-lora) |
50
+ | chinese-text-correction-7b | Qwen/Qwen2.5-7B-Instruct | [🤗 Hugging Face](https://huggingface.co/shibing624/chinese-text-correction-7b) |
51
+ | chinese-text-correction-7b-lora | Qwen/Qwen2.5-7B-Instruct | [🤗 Hugging Face](https://huggingface.co/shibing624/chinese-text-correction-7b-lora) |
52
+
53
+
54
+ ### 评估结果
55
+ - 评估指标:F1
56
+ - CSC(Chinese Spelling Correction): 拼写纠错模型,表示模型可以处理音似、形似、语法等长度对齐的错误纠正
57
+ - CTC(CHinese Text Correction): 文本纠错模型,表示模型支持拼写、语法等长度对齐的错误纠正,还可以处理多字、少字等长度不对齐的错误纠正
58
+ - GPU:Tesla V100,显存 32 GB
59
+
60
+ | Model Name | Model Link | Base Model | Avg | SIGHAN-2015 | EC-LAW | MCSC | GPU/CPU | QPS |
61
+ |:-----------------|:------------------------------------------------------------------------------------------------------------------------|:---------------------------|:-----------|:------------|:-------|:-------|:--------|:--------|
62
+ | Kenlm-CSC | [shibing624/chinese-kenlm-klm](https://huggingface.co/shibing624/chinese-kenlm-klm) | kenlm | 0.3409 | 0.3147 | 0.3763 | 0.3317 | CPU | 9 |
63
+ | Mengzi-T5-CSC | [shibing624/mengzi-t5-base-chinese-correction](https://huggingface.co/shibing624/mengzi-t5-base-chinese-correction) | mengzi-t5-base | 0.3984 | 0.7758 | 0.3156 | 0.1039 | GPU | 214 |
64
+ | ERNIE-CSC | [PaddleNLP/ernie-csc](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/legacy/examples/text_correction/ernie-csc) | PaddlePaddle/ernie-1.0-base-zh | 0.4353 | 0.8383 | 0.3357 | 0.1318 | GPU | 114 |
65
+ | MacBERT-CSC | [shibing624/macbert4csc-base-chinese](https://huggingface.co/shibing624/macbert4csc-base-chinese) | hfl/chinese-macbert-base | 0.3993 | 0.8314 | 0.1610 | 0.2055 | GPU | **224** |
66
+ | ChatGLM3-6B-CSC | [shibing624/chatglm3-6b-csc-chinese-lora](https://huggingface.co/shibing624/chatglm3-6b-csc-chinese-lora) | THUDM/chatglm3-6b | 0.4538 | 0.6572 | 0.4369 | 0.2672 | GPU | 3 |
67
+ | Qwen2.5-1.5B-CTC | [shibing624/chinese-text-correction-1.5b](https://huggingface.co/shibing624/chinese-text-correction-1.5b) | Qwen/Qwen2.5-1.5B-Instruct | 0.6802 | 0.3032 | 0.7846 | 0.9529 | GPU | 6 |
68
+ | Qwen2.5-7B-CTC | [shibing624/chinese-text-correction-7b](https://huggingface.co/shibing624/chinese-text-correction-7b) | Qwen/Qwen2.5-7B-Instruct | **0.8225** | 0.4917 | 0.9798 | 0.9959 | GPU | 3 |
69
+
70
+ ## Usage (pycorrector)
71
+
72
+ 本项目开源在`pycorrector`项目:[pycorrector](https://github.com/shibing624/pycorrector),可支持大模型微调后用于文本纠错,通过如下命令调用:
73
+
74
+ Install package:
75
+ ```shell
76
+ pip install -U pycorrector
77
+ ```
78
+
79
+ ```python
80
+ from pycorrector.gpt.gpt_corrector import GptCorrector
81
+
82
+ if __name__ == '__main__':
83
+ error_sentences = [
84
+ '真麻烦你了。希望你们好好的跳无',
85
+ '少先队员因该为老人让坐',
86
+ '机七学习是人工智能领遇最能体现智能的一个分知',
87
+ '一只小鱼船浮在平净的河面上',
88
+ '我的家乡是有明的渔米之乡',
89
+ ]
90
+ m = GptCorrector("shibing624/chinese-text-correction-7b")
91
+
92
+ batch_res = m.correct_batch(error_sentences)
93
+ for i in batch_res:
94
+ print(i)
95
+ print()
96
+ ```
97
+
98
+ ## Usage (HuggingFace Transformers)
99
+ Without [pycorrector](https://github.com/shibing624/pycorrector), you can use the model like this:
100
+
101
+ First, you pass your input through the transformer model, then you get the generated sentence.
102
+
103
+ Install package:
104
+ ```
105
+ pip install transformers
106
+ ```
107
+
108
+ ```python
109
+ # pip install transformers
110
+ from transformers import AutoModelForCausalLM, AutoTokenizer
111
+ checkpoint = "shibing624/chinese-text-correction-7b"
112
+
113
+ device = "cuda" # for GPU usage or "cpu" for CPU usage
114
+ tokenizer = AutoTokenizer.from_pretrained(checkpoint)
115
+ model = AutoModelForCausalLM.from_pretrained(checkpoint).to(device)
116
+
117
+ input_content = "文本纠错:\n少先队员因该为老人让坐。"
118
+
119
+ messages = [{"role": "user", "content": input_content}]
120
+ input_text=tokenizer.apply_chat_template(messages, tokenize=False)
121
+
122
+ print(input_text)
123
+
124
+ inputs = tokenizer.encode(input_text, return_tensors="pt").to(device)
125
+ outputs = model.generate(inputs, max_new_tokens=1024, temperature=0, do_sample=False, repetition_penalty=1.08)
126
+
127
+ print(tokenizer.decode(outputs[0]))
128
+ ```
129
+
130
+ output:
131
+ ```shell
132
+ 少先队员应该为老人让座。
133
+ ```
134
+
135
+
136
+ 模型文件组成:
137
+ ```
138
+ shibing624/chinese-text-correction-7b
139
+ |-- added_tokens.json
140
+ |-- config.json
141
+ |-- generation_config.json
142
+ |-- merges.txt
143
+ |-- model.safetensors
144
+ |-- model.safetensors.index.json
145
+ |-- README.md
146
+ |-- special_tokens_map.json
147
+ |-- tokenizer_config.json
148
+ |-- tokenizer.json
149
+ `-- vocab.json
150
+ ```
151
+
152
+ #### 训练参数:
153
+
154
+ - num_epochs: 8
155
+ - batch_size: 2
156
+ - steps: 36000
157
+ - eval_loss: 0.12
158
+ - base model: Qwen/Qwen2.5-7B-Instruct
159
+ - train data: [shibing624/chinese_text_correction](https://huggingface.co/datasets/shibing624/chinese_text_correction)
160
+ - train time: 10 days
161
+ - eval_loss: ![](https://huggingface.co/shibing624/chinese-text-correction-7b-lora/resolve/main/eval_loss_7b.png)
162
+ - train_loss: ![](https://huggingface.co/shibing624/chinese-text-correction-7b-lora/resolve/main/train_loss_7b.png)
163
+
164
+ ### 训练数据集
165
+ #### 中文纠错数据集
166
+
167
+ - 数据:[shibing624/chinese_text_correction](https://huggingface.co/datasets/shibing624/chinese_text_correction)
168
+
169
+
170
+ 如果需要训练Qwen的纠错模型,请参考[https://github.com/shibing624/pycorrector](https://github.com/shibing624/pycorrector) 或者 [https://github.com/shibing624/MedicalGPT](https://github.com/shibing624/MedicalGPT)
171
+
172
+ ## Citation
173
+
174
+ ```latex
175
+ @software{pycorrector,
176
+ author = {Xu Ming},
177
+ title = {pycorrector: Implementation of language model finetune},
178
+ year = {2024},
179
+ url = {https://github.com/shibing624/pycorrector},
180
+ }
181
+ ```
182
+