Update README.md

2335a7e verified 3 months ago

4.06 kB

	---
	language:
	- ja
	- en
	tags:
	- translation
	- machine-translation
	- academic
	- japanese
	- english
	- marianmt
	datasets:
	- aspec
	library_name: transformers
	pipeline_tag: translation
	model-index:
	- name: ywc1/marian-finetuned-ja-en
	results:
	- task:
	name: Translation
	type: translation
	args:
	source_language: ja
	target_language: en
	dataset:
	name: ASPEC Japanese-English
	type: aspec
	split: test
	metrics:
	- name: BLEU
	type: bleu
	value: 32.73
	- name: METEOR
	type: meteor
	value: 0.66
	- name: COMET
	type: comet
	value: 0.85
	---

	# Model Card: Japanese-English Academic Translator [Sentence-Level]
	## Model Details

	Model name: ywc1/marian-finetuned-ja-en

	Developed by: Susie Xu and Kenneth Zhang

	Languages: Japanese → English

	Finetuned from: Helsinki-NLP/opus-mt-ja-en

	Architecture: MarianMT (Transformer encoder–decoder)

	## Model Description

	This model is fine-tuned from MarianMT for academic text translation from Japanese to English at the sentence level.
	It is designed to preserve technical vocabulary, proper nouns, and factual accuracy in the scientific domain, based on the ASPEC (Asian Scientific Paper Excerpt Corpus).
	Because it was trained on single sentences, it may underperform on multi-sentence or paragraph inputs. For longer academic passages, use ywc1/mbart-finetuned-ja-en-para.

	## Intended Uses & Limitations

	### Intended uses:

	Translating individual sentences from Japanese academic papers to English.

	Assisting researchers in quickly understanding scientific literature written in Japanese.

	### Limitations:

	Not optimized for conversational, literary, or informal text.

	May produce less fluent results for multi-sentence inputs.

	Sentence-level fluency can sometimes lag behind factual accuracy.

	## How to Use
	from transformers import MarianMTModel, MarianTokenizer

	model = MarianMTModel.from_pretrained("ywc1/marian-finetuned-ja-en")
	tokenizer = MarianTokenizer.from_pretrained("ywc1/marian-finetuned-ja-en")

	text = "ＤＥＲＳソフトウェアを用いれば「ふげん発電所」の線量率を詳細に計算できる。"
	inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True)
	translated = model.generate(**inputs)
	print(tokenizer.decode(translated[0], skip_special_tokens=True))


	### Web interface: Hugging Face Spaces

	## Training Details
	### Training Data

	Dataset: ASPEC (Asian Scientific Paper Excerpt Corpus) – Japanese-English subset.

	Size used: 300,000 sentence pairs (~30% of total dataset).

	Domain: Academic papers in science and technology (pre-2010).

	### Preprocessing:

	Removed dataset IDs.

	Kept only Japanese-English aligned pairs.

	Tokenized source text (Japanese) and target text (English) for encoder-decoder input.

	Training Procedure

	Compute: Google Cloud Vertex AI, NVIDIA L1 GPU (40GB RAM) for most training; occasional NVIDIA T4 on Colab.

	### Hyperparameters:

	Learning Rate: 0.0003

	Batch Size: 2

	Weight Decay: 1.0

	Gradient Accumulation Steps: 4

	Epochs: 6

	Training time: ~15 hours

	## Evaluation
	### Testing Data

	Official ASPEC Japanese-English test split (sentence level).

	### Metrics
	Metric Base Fine-tuned % Improvement
	BLEU 12.19 32.73 +169%
	METEOR 0.42 0.66 +58%
	COMET 0.77 0.85 +11%
	Example Outputs

	Input:
	ＤＥＲＳソフトウェアを用いれば「ふげん発電所」の線量率を詳細に計算できる。
	Reference:
	Details of dose rate of "Fugen Power Plant" can be calculated by using DERS software.
	Model Output:
	Using the DERS software, the dose rate of "Fugen power plant" can be calculated in detail.

	### Environmental Impact

	Hardware: 1× NVIDIA L1 GPU (40GB RAM)

	Training time: ~15 hours

	Cloud provider: Google Cloud Vertex AI

	## Citation

	BibTeX:

	@misc{xu2025marianmtacademic,
	title={Japanese-English Academic Translator [Sentence-Level]},
	author={Xu, Yifan and Zhang, Kenneth},
	year={2025},
	publisher={Hugging Face},
	howpublished={\url{https://huggingface.co/susiexyf/marian-finetuned-ja-en}}
	}