agentlans/en-zhtw · Hugging Face

English-to-Traditional Chinese Translator

This model is a fine-tuned version of Helsinki-NLP/opus-mt-en-zh, trained on the agentlans/en-zhtw-google-translate dataset.

It is optimized to produce Traditional Chinese translations by default, enhancing the naturalness and fluency of the output.

Model Description

Input: English text only
Output: Traditional Chinese translation

英文至繁體中文翻譯模型

本模型為 Helsinki-NLP/opus-mt-en-zh 的微調版本，使用 agentlans/en-zhtw-google-translate 資料集進行訓練。

模型已針對輸出繁體中文進行最佳化，提升了翻譯結果的自然度與流暢性。

模型說明

輸入： 僅支援英文文本
輸出： 繁體中文翻譯

How to use / 如何使用

from transformers import pipeline

# Load the translation model
# 載入翻譯模型
model_checkpoint = "agentlans/en-zhtw"
translator = pipeline("translation", model=model_checkpoint)

# This is for correcting English punctuation marks to Traditional Chinese.
# 這是為了將英語標點符號校正為繁體中文。
def en_to_zh_punct(text):
    punct = {
        '!': '！', '?': '？', ',': '，', '.': '。',
        ':': '：', ';': '；', '(': '（', ')': '）',
        '[': '【', ']': '】', '{': '｛', '}': '｝'
    }
    result, in_dq, in_sq = [], False, False
    for ch in text:
        if ch == '"':
            result.append("」" if in_dq else "「")
            in_dq = not in_dq
        elif ch == "'":
            result.append("』" if in_sq else "『")
            in_sq = not in_sq
        else:
            result.append(punct.get(ch, ch))
    return "".join(result)

# The main function for translating English to Traditional Chinese
# 將英語翻譯成繁體中文的主要功能
def translate(en_text):
    return [en_to_zh_punct(x["translation_text"]) for x in translator(en_text)]

# Example
# 範例
translate(
    [
        "Trump announces new tariffs on penguin islands. The penguins plan to tax U.S. imports in retaliation.",
        "We now return to the White House for the latest developments on the trade war.",
    ]
)
# ['川普宣佈對企鵝島徵收新關稅，企鵝打算對美國進口產品徵稅報復。', '我們現在回到白宮尋找貿易戰的最新發展。']

Limitations / 限制

Limitations

Handles only one- or two-sentence inputs in English effectively.
Struggles with English spelling, names, abbreviations, and especially technical terminology.
Uses unusual punctuation like the English comma instead of the Chinese comma.
Has difficulty understanding context.
As a result, may generate inaccurate information or omit important details.
Sometimes uses incorrect words due to the base model being primarily trained on Simplified Chinese, which does not always correspond directly to Traditional Chinese.

限制

僅適用於處理一至兩句英文句子的輸入，處理較長段落時效果有限。
難以準確掌握英語拼字、專有名詞及縮寫，尤其在處理技術術語時表現不佳。
常出現標點符號使用不當的情況，例如以英文逗號取代中文逗號。
對語境的理解能力有限。
可能導致資訊不準確或遺漏重要細節。
由於基礎模型主要以簡體中文語料訓練，有時會使用不自然或錯誤的詞語，簡體與繁體用語之間也未必能精確對應。

Training procedure / 訓練過程

Click here / 點這裡

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 5e-05
train_batch_size: 8
eval_batch_size: 8
seed: 42
optimizer: Use adamw_torch with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
lr_scheduler_type: linear
num_epochs: 5.0

Training results

Training Loss	Epoch	Step	Validation Loss	Input Tokens Seen
1.3993	1.0	99952	1.2487	54454616
1.2801	2.0	199904	1.1701	108935048
1.1728	3.0	299856	1.1232	163424808
1.1001	4.0	399808	1.0871	217911400
1.0243	5.0	499760	1.0584	272407288

Framework versions

Transformers 4.51.3
Pytorch 2.6.0+cu124
Datasets 3.2.0
Tokenizers 0.21.0

agentlans
/

en-zhtw