Janus-Pro-7B / README.md

Update README.md

929076f verified 6 months ago

5.11 kB

	---
	license: mit
	license_name: deepseek
	license_link: LICENSE
	pipeline_tag: any-to-any
	library_name: transformers
	tags:
	- muiltimodal
	- text-to-image
	- unified-model
	---

	## 1. Introduction

	Janus-Pro is a novel autoregressive framework that unifies multimodal understanding and generation.
	It addresses the limitations of previous approaches by decoupling visual encoding into separate pathways, while still utilizing a single, unified transformer architecture for processing. The decoupling not only alleviates the conflict between the visual encoder’s roles in understanding and generation, but also enhances the framework’s flexibility.
	Janus-Pro surpasses previous unified model and matches or exceeds the performance of task-specific models.
	The simplicity, high flexibility, and effectiveness of Janus-Pro make it a strong candidate for next-generation unified multimodal models.

	[Github Repository](https://github.com/deepseek-ai/Janus)

	<div align="center">
	<img alt="image" src="https://huggingface.co/deepseek-community/Janus-Pro-1B/resolve/main/janus_pro_teaser1.png" style="width:90%;">
	</div>

	<div align="center">
	<img alt="image" src="https://huggingface.co/deepseek-community/Janus-Pro-1B/resolve/main/janus_pro_teaser2.png" style="width:90%;">
	</div>


	### 2. Model Summary

	Janus-Pro is a unified understanding and generation MLLM, which decouples visual encoding for multimodal understanding and generation.
	Janus-Pro is constructed based on the DeepSeek-LLM-1.5b-base/DeepSeek-LLM-7b-base.

	For multimodal understanding, it uses the [SigLIP-L](https://huggingface.co/timm/ViT-L-16-SigLIP-384) as the vision encoder, which supports 384 x 384 image input. For image generation, Janus-Pro uses the tokenizer from [here](https://github.com/FoundationVision/LlamaGen) with a downsample rate of 16.

	## 3. Usage Examples

	### Single Image Inference

	Here is an example of visual understanding with a single image.

	```python
	import torch
	from PIL import Image
	import requests
	from transformers import JanusForConditionalGeneration, JanusProcessor

	model_id = "deepseek-community/Janus-Pro-7B"

	# Prepare input for generation
	messages = [
	{
	"role": "user",
	"content": [
	{'type': 'image', 'url': 'http://images.cocodataset.org/val2017/000000039769.jpg'},
	{'type': 'text', 'text': "What do you see in this image?"}
	]
	},
	]

	# Set generation mode to 'text' to perform text generation
	processor = JanusProcessor.from_pretrained(model_id)
	model = JanusForConditionalGeneration.from_pretrained(
	model_id, torch_dtype=torch.bfloat16, device_map="auto"
	)

	inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	generation_mode="text",
	tokenize=True,
	return_dict=True,
	return_tensors="pt"
	).to(model.device, dtype=torch.bfloat16)

	output = model.generate(**inputs, max_new_tokens=40, generation_mode='text', do_sample=True)
	text = processor.decode(output[0], skip_special_tokens=True)
	print(text)
	```

	## Text to Image generation
	Janus can also generate images from prompts by simply setting the generation mode to `image` as shown below.

	```python
	import torch
	from transformers import JanusForConditionalGeneration, JanusProcessor

	model_id = "deepseek-community/Janus-Pro-7B"

	# Load processor and model
	processor = JanusProcessor.from_pretrained(model_id)
	model = JanusForConditionalGeneration.from_pretrained(
	model_id, torch_dtype=torch.bfloat16, device_map="auto"
	)

	messages = [
	{
	"role": "user",
	"content": [
	{"type": "text", "text": "A dog running under the rain."}
	]
	}
	]

	# Apply chat template
	prompt = processor.apply_chat_template(messages, add_generation_prompt=True)
	inputs = processor(
	text=prompt,
	generation_mode="image",
	return_tensors="pt"
	).to(model.device, dtype=torch.bfloat16)

	# Set number of images to generate
	model.generation_config.num_return_sequences = 2

	outputs = model.generate(
	**inputs,
	generation_mode="image",
	do_sample=True,
	use_cache=True
	)

	# Decode and save images
	decoded_image = model.decode_image_tokens(outputs)
	images = processor.postprocess(list(decoded_image.float()), return_tensors="PIL.Image.Image")

	for i, image in enumerate(images["pixel_values"]):
	image.save(f"image{i}.png")
	```

	## 4. License

	This code repository is licensed under [the MIT License](https://github.com/deepseek-ai/DeepSeek-LLM/blob/HEAD/LICENSE-CODE). The use of Janus-Pro models is subject to [DeepSeek Model License](https://github.com/deepseek-ai/DeepSeek-LLM/blob/HEAD/LICENSE-MODEL).
	## 5. Citation

	```
	@article{chen2025janus,
	title={Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling},
	author={Chen, Xiaokang and Wu, Zhiyu and Liu, Xingchao and Pan, Zizheng and Liu, Wen and Xie, Zhenda and Yu, Xingkai and Ruan, Chong},
	journal={arXiv preprint arXiv:2501.17811},
	year={2025}
	}
	```

	## 6. Contact

	If you have any questions, please raise an issue or contact us at [[email protected]](mailto:[email protected]).