|
|
--- |
|
|
language: |
|
|
- ja |
|
|
- en |
|
|
tags: |
|
|
- translation |
|
|
- machine-translation |
|
|
- academic |
|
|
- japanese |
|
|
- english |
|
|
- marianmt |
|
|
datasets: |
|
|
- aspec |
|
|
library_name: transformers |
|
|
pipeline_tag: translation |
|
|
model-index: |
|
|
- name: ywc1/marian-finetuned-ja-en |
|
|
results: |
|
|
- task: |
|
|
name: Translation |
|
|
type: translation |
|
|
args: |
|
|
source_language: ja |
|
|
target_language: en |
|
|
dataset: |
|
|
name: ASPEC Japanese-English |
|
|
type: aspec |
|
|
split: test |
|
|
metrics: |
|
|
- name: BLEU |
|
|
type: bleu |
|
|
value: 32.73 |
|
|
- name: METEOR |
|
|
type: meteor |
|
|
value: 0.66 |
|
|
- name: COMET |
|
|
type: comet |
|
|
value: 0.85 |
|
|
--- |
|
|
|
|
|
# Model Card: Japanese-English Academic Translator [Sentence-Level] |
|
|
## Model Details |
|
|
|
|
|
Model name: ywc1/marian-finetuned-ja-en |
|
|
|
|
|
Developed by: Susie Xu and Kenneth Zhang |
|
|
|
|
|
Languages: Japanese → English |
|
|
|
|
|
Finetuned from: Helsinki-NLP/opus-mt-ja-en |
|
|
|
|
|
Architecture: MarianMT (Transformer encoder–decoder) |
|
|
|
|
|
## Model Description |
|
|
|
|
|
This model is fine-tuned from MarianMT for academic text translation from Japanese to English at the sentence level. |
|
|
It is designed to preserve technical vocabulary, proper nouns, and factual accuracy in the scientific domain, based on the ASPEC (Asian Scientific Paper Excerpt Corpus). |
|
|
Because it was trained on single sentences, it may underperform on multi-sentence or paragraph inputs. For longer academic passages, use ywc1/mbart-finetuned-ja-en-para. |
|
|
|
|
|
## Intended Uses & Limitations |
|
|
|
|
|
### Intended uses: |
|
|
|
|
|
Translating individual sentences from Japanese academic papers to English. |
|
|
|
|
|
Assisting researchers in quickly understanding scientific literature written in Japanese. |
|
|
|
|
|
### Limitations: |
|
|
|
|
|
Not optimized for conversational, literary, or informal text. |
|
|
|
|
|
May produce less fluent results for multi-sentence inputs. |
|
|
|
|
|
Sentence-level fluency can sometimes lag behind factual accuracy. |
|
|
|
|
|
## How to Use |
|
|
from transformers import MarianMTModel, MarianTokenizer |
|
|
|
|
|
model = MarianMTModel.from_pretrained("ywc1/marian-finetuned-ja-en") |
|
|
tokenizer = MarianTokenizer.from_pretrained("ywc1/marian-finetuned-ja-en") |
|
|
|
|
|
text = "DERSソフトウェアを用いれば「ふげん発電所」の線量率を詳細に計算できる。" |
|
|
inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True) |
|
|
translated = model.generate(**inputs) |
|
|
print(tokenizer.decode(translated[0], skip_special_tokens=True)) |
|
|
|
|
|
|
|
|
### Web interface: Hugging Face Spaces |
|
|
|
|
|
## Training Details |
|
|
### Training Data |
|
|
|
|
|
Dataset: ASPEC (Asian Scientific Paper Excerpt Corpus) – Japanese-English subset. |
|
|
|
|
|
Size used: 300,000 sentence pairs (~30% of total dataset). |
|
|
|
|
|
Domain: Academic papers in science and technology (pre-2010). |
|
|
|
|
|
### Preprocessing: |
|
|
|
|
|
Removed dataset IDs. |
|
|
|
|
|
Kept only Japanese-English aligned pairs. |
|
|
|
|
|
Tokenized source text (Japanese) and target text (English) for encoder-decoder input. |
|
|
|
|
|
Training Procedure |
|
|
|
|
|
Compute: Google Cloud Vertex AI, NVIDIA L1 GPU (40GB RAM) for most training; occasional NVIDIA T4 on Colab. |
|
|
|
|
|
### Hyperparameters: |
|
|
|
|
|
Learning Rate: 0.0003 |
|
|
|
|
|
Batch Size: 2 |
|
|
|
|
|
Weight Decay: 1.0 |
|
|
|
|
|
Gradient Accumulation Steps: 4 |
|
|
|
|
|
Epochs: 6 |
|
|
|
|
|
Training time: ~15 hours |
|
|
|
|
|
## Evaluation |
|
|
### Testing Data |
|
|
|
|
|
Official ASPEC Japanese-English test split (sentence level). |
|
|
|
|
|
### Metrics |
|
|
Metric Base Fine-tuned % Improvement |
|
|
BLEU 12.19 32.73 +169% |
|
|
METEOR 0.42 0.66 +58% |
|
|
COMET 0.77 0.85 +11% |
|
|
Example Outputs |
|
|
|
|
|
Input: |
|
|
DERSソフトウェアを用いれば「ふげん発電所」の線量率を詳細に計算できる。 |
|
|
Reference: |
|
|
Details of dose rate of "Fugen Power Plant" can be calculated by using DERS software. |
|
|
Model Output: |
|
|
Using the DERS software, the dose rate of "Fugen power plant" can be calculated in detail. |
|
|
|
|
|
### Environmental Impact |
|
|
|
|
|
Hardware: 1× NVIDIA L1 GPU (40GB RAM) |
|
|
|
|
|
Training time: ~15 hours |
|
|
|
|
|
Cloud provider: Google Cloud Vertex AI |
|
|
|
|
|
## Citation |
|
|
|
|
|
BibTeX: |
|
|
|
|
|
@misc{xu2025marianmtacademic, |
|
|
title={Japanese-English Academic Translator [Sentence-Level]}, |
|
|
author={Xu, Yifan and Zhang, Kenneth}, |
|
|
year={2025}, |
|
|
publisher={Hugging Face}, |
|
|
howpublished={\url{https://huggingface.co/susiexyf/marian-finetuned-ja-en}} |
|
|
} |
|
|
|