URSA-0.6B-FSQ320 / README.md

Initial commit

2440a09 2 months ago

4.73 kB

	---
	library_name: diffusers
	license: apache-2.0
	license_link: https://huggingface.co/BAAI/URSA-0.6B-FSQ320/blob/main/LICENSE
	pipeline_tag: text-to-video
	base_model:
	- Qwen/Qwen3-0.6B
	---

	# URSA-0.6B-FSQ320 Model Card

	## Model Details
	- Developed by: BAAI
	- Model type: Text-to-Video Generation Model
	- Model size: 0.6B
	- Model precision: torch.float16 (FP16)
	- Model resolution: 512x320
	- Model paper: [Uniform Discrete Diffusion with Metric Path for Video Generation](https://arxiv.org/abs/2510.24717)
	- Model family: [BAAI-Vision-URSA](https://github.com/baaivision/URSA)
	- Model Tokenizer: [Cosmos-Tokenize1-DV4x8x8-360p](https://huggingface.co/nvidia/Cosmos-Tokenize1-DV4x8x8-360p)
	- Model Description: This is a model that can be used to generate and modify videos based on text prompts.

	## Examples

	Using the [🤗's Diffusers library](https://github.com/huggingface/diffusers) to run URSA in a simple and efficient manner.

	```bash
	pip install diffusers transformers accelerate imageio[ffmpeg]
	pip install git+ssh://[email protected]/baaivision/URSA.git
	```

	Running the pipeline:

	```python
	import os, torch, numpy
	from diffnext.pipelines import URSAPipeline
	from diffnext.utils import export_to_video
	os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "expandable_segments:True"

	model_id, height, width = "BAAI/URSA-0.6B-FSQ320", 320, 512
	model_args = {"torch_dtype": torch.float16, "trust_remote_code": True}
	pipe = URSAPipeline.from_pretrained(model_id, **model_args)
	pipe = pipe.to(torch.device("cuda"))

	text_prompt = "a lone grizzly bear walks through a misty forest at dawn, sunlight catching its fur."
	negative_prompt = "worst quality, low quality, inconsistent motion, static, still, blurry, jittery, distorted, ugly"

	# Text-to-Image
	prompt = text_prompt
	num_frames, num_inference_steps = 1, 25
	image = pipe(**locals()).frames[0]
	image.save("ursa.jpg")

	# Image-to-Video
	prompt = f"motion=9.0, {text_prompt}"
	num_frames, num_inference_steps = 49, 50
	video = pipe(**locals()).frames[0]
	export_to_video(video, "ursa_1+48f.mp4", fps=12)

	# Text-to-Video
	image, video = None, None
	prompt = f"motion=9.0, {text_prompt}"
	num_frames, num_inference_steps = 49, 50
	video = pipe(**locals()).frames[0]
	export_to_video(video, "ursa_49f.mp4", fps=12)

	# Video-to-Video
	prompt = f"motion=5.0, {text_prompt}"
	num_frames, num_inference_steps = 49, 50
	num_cond_frames, cond_noise_scale = 13, 0.1
	for i in range(12):
	video, start_video = video[-num_cond_frames:], video
	video = pipe(**locals()).frames[0]
	video = numpy.concatenate([start_video, video[num_cond_frames:]])
	export_to_video(video, "ursa_{}f.mp4".format(video.shape[0]), fps=12)
	```

	# Uses

	## Direct Use
	The model is intended for research purposes only. Possible research areas and tasks include

	- Research on generative models.
	- Applications in educational or creative tools.
	- Generation of artworks and use in design and other artistic processes.
	- Probing and understanding the limitations and biases of generative models.
	- Safe deployment of models which have the potential to generate harmful content.

	Excluded uses are described below.

	#### Out-of-Scope Use
	The model was not trained to be factual or true representations of people or events, and therefore using the model to generate such content is out-of-scope for the abilities of this model.

	#### Misuse and Malicious Use
	Using the model to generate content that is cruel to individuals is a misuse of this model. This includes, but is not limited to:

	- Mis- and disinformation.
	- Representations of egregious violence and gore.
	- Impersonating individuals without their consent.
	- Sexual content without consent of the people who might see it.
	- Sharing of copyrighted or licensed material in violation of its terms of use.
	- Intentionally promoting or propagating discriminatory content or harmful stereotypes.
	- Sharing content that is an alteration of copyrighted or licensed material in violation of its terms of use.
	- Generating demeaning, dehumanizing, or otherwise harmful representations of people or their environments, cultures, religions, etc.

	## Limitations and Bias

	### Limitations

	- The autoencoding part of the model is lossy.
	- The model cannot render complex legible text.
	- The model does not achieve perfect photorealism.
	- The fingers, .etc in general may not be generated properly.
	- The model was trained on a subset of the web datasets [LAION-5B](https://laion.ai/blog/laion-5b/) and [COYO-700M](https://github.com/kakaobrain/coyo-dataset), which contains adult, violent and sexual content.

	### Bias
	While the capabilities of image generation models are impressive, they can also reinforce or exacerbate social biases.