Improve model card: add sample usage and update library tag

#1
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +39 -12
README.md CHANGED
@@ -1,21 +1,17 @@
1
  ---
2
- license: apache-2.0
3
  base_model:
4
  - Qwen/Qwen2.5-7B-Instruct
 
 
5
  pipeline_tag: any-to-any
6
- library_name: bagel-mot
7
  ---
8
 
9
-
10
  <p align="left">
11
  <img src="https://lf3-static.bytednsdoc.com/obj/eden-cn/nuhojubrps/banner.png" alt="BAGEL" width="480"/>
12
  </p>
13
 
14
-
15
  # 🥯 BAGEL • Unified Model for Multimodal Understanding and Generation
16
 
17
-
18
-
19
  <p align="left">
20
  <a href="https://bagel-ai.org/">
21
  <img
@@ -51,21 +47,55 @@ library_name: bagel-mot
51
 
52
  </p>
53
 
54
-
55
  > We present **BAGEL**, an open‑source multimodal foundation model with 7B active parameters (14B total) trained on large‑scale interleaved multimodal data. BAGEL outperforms the current top‑tier open‑source VLMs like Qwen2.5-VL and InternVL-2.5 on standard multimodal understanding leaderboards, and delivers text‑to‑image quality that is competitive with strong specialist generators such as SD3.
56
- Moreover, BAGEL demonstrates superior qualitative results in classical image‑editing scenarios than the leading open-source models. More importantly, it extends to free-form visual manipulation, multiview synthesis, and world navigation, capabilities that constitute "world-modeling" tasks beyond the scope of previous image-editing models.
57
 
58
 
59
  This repository hosts the model weights for **BAGEL**. For installation, usage instructions, and further documentation, please visit our [GitHub repository](https://github.com/bytedance-seed/BAGEL).
60
 
 
61
 
 
62
 
63
- <p align="left"><img src="https://github.com/ByteDance-Seed/Bagel/raw/main/assets/teaser.webp" width="80%"></p>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
64
 
 
65
 
 
 
 
66
 
 
 
 
67
 
 
 
 
68
 
 
 
69
 
70
  ## 🧠 Method
71
  BAGEL adopts a Mixture-of-Transformer-Experts (MoT) architecture to maximize the model’s capacity to learn from richly diverse multimodal information. Following the same principle of capacity maximization, it utilizes two separate encoders to capture pixel-level and semantic-level features of an image. The overall framework follows a Next Group of Token Prediction paradigm, where the model is trained to predict the next group of language or visual tokens as a compression target.
@@ -74,14 +104,11 @@ BAGEL scales MoT’s capacity through Pre-training, Continued Training, and Supe
74
 
75
  <p align="left"><img src="https://github.com/ByteDance-Seed/Bagel/raw/main/assets/arch.png" width="50%"></p>
76
 
77
-
78
  ## 🌱 Emerging Properties
79
  <p align="left"><img src="https://github.com/ByteDance-Seed/Bagel/raw/main/assets/emerging_curves.png" width="50%"></p>
80
 
81
  As we scale up BAGEL’s pretraining with more multimodal tokens, we observe consistent performance gains across understanding, generation, and editing tasks. Different capabilities emerge at distinct training stages—multimodal understanding and generation appear early, followed by basic editing, while complex, intelligent editing emerges later. This staged progression suggests an emergent pattern, where advanced multimodal reasoning builds on well-formed foundational skills. Ablation studies further show that combining VAE and ViT features significantly improves intelligent editing, underscoring the importance of visual-semantic context in enabling complex multimodal reasoning and further supporting its role in the emergence of advanced capabilities.
82
 
83
-
84
-
85
  ## 📊 Benchmarks
86
  ### 1. Visual Understanding
87
  | Model | MME ↑ | MMBench ↑ | MMMU ↑ | MM-Vet ↑ | MathVista ↑ |
 
1
  ---
 
2
  base_model:
3
  - Qwen/Qwen2.5-7B-Instruct
4
+ library_name: transformers
5
+ license: apache-2.0
6
  pipeline_tag: any-to-any
 
7
  ---
8
 
 
9
  <p align="left">
10
  <img src="https://lf3-static.bytednsdoc.com/obj/eden-cn/nuhojubrps/banner.png" alt="BAGEL" width="480"/>
11
  </p>
12
 
 
13
  # 🥯 BAGEL • Unified Model for Multimodal Understanding and Generation
14
 
 
 
15
  <p align="left">
16
  <a href="https://bagel-ai.org/">
17
  <img
 
47
 
48
  </p>
49
 
 
50
  > We present **BAGEL**, an open‑source multimodal foundation model with 7B active parameters (14B total) trained on large‑scale interleaved multimodal data. BAGEL outperforms the current top‑tier open‑source VLMs like Qwen2.5-VL and InternVL-2.5 on standard multimodal understanding leaderboards, and delivers text‑to‑image quality that is competitive with strong specialist generators such as SD3.
51
+ Moreover, BAGEL demonstrates superior qualitative results in classical image‑editing scenarios than the leading open-source models. More importantly, it extends to free-form visual manipulation, multiview synthesis, 3D manipulation, and world navigation, capabilities that constitute "world-modeling" tasks beyond the scope of previous image-editing models.
52
 
53
 
54
  This repository hosts the model weights for **BAGEL**. For installation, usage instructions, and further documentation, please visit our [GitHub repository](https://github.com/bytedance-seed/BAGEL).
55
 
56
+ <p align="left"><img src="https://github.com/ByteDance-Seed/Bagel/raw/main/assets/teaser.webp" width="80%"></p>
57
 
58
+ ## Usage
59
 
60
+ Here's how to use BAGEL for multimodal inference:
61
+
62
+ ```python
63
+ import torch
64
+ from transformers import AutoProcessor, AutoModelForCausalLM
65
+ from PIL import Image
66
+ import requests
67
+
68
+ # Load model and processor
69
+ model_id = "bytedance-seed/BAGEL"
70
+ processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
71
+ model = AutoModelForCausalLM.from_pretrained(
72
+ model_id,
73
+ torch_dtype=torch.bfloat16,
74
+ device_map="auto",
75
+ trust_remote_code=True,
76
+ )
77
+
78
+ # Example: Multimodal input (image and text)
79
+ # Load an image (e.g., from a URL or local path)
80
+ image_url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/bee.JPG"
81
+ image = Image.open(requests.get(image_url, stream=True).raw)
82
 
83
+ query = "What is in this image?"
84
 
85
+ messages = [
86
+ {"role": "user", "content": [{"type": "image", "image": image}, {"type": "text", "text": query}]}
87
+ ]
88
 
89
+ # Apply chat template and prepare inputs
90
+ text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
91
+ inputs = processor(text=text, images=image, return_tensors="pt").to(model.device)
92
 
93
+ # Generate response
94
+ generated_ids = model.generate(**inputs, max_new_tokens=100)
95
+ response = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
96
 
97
+ print(response)
98
+ ```
99
 
100
  ## 🧠 Method
101
  BAGEL adopts a Mixture-of-Transformer-Experts (MoT) architecture to maximize the model’s capacity to learn from richly diverse multimodal information. Following the same principle of capacity maximization, it utilizes two separate encoders to capture pixel-level and semantic-level features of an image. The overall framework follows a Next Group of Token Prediction paradigm, where the model is trained to predict the next group of language or visual tokens as a compression target.
 
104
 
105
  <p align="left"><img src="https://github.com/ByteDance-Seed/Bagel/raw/main/assets/arch.png" width="50%"></p>
106
 
 
107
  ## 🌱 Emerging Properties
108
  <p align="left"><img src="https://github.com/ByteDance-Seed/Bagel/raw/main/assets/emerging_curves.png" width="50%"></p>
109
 
110
  As we scale up BAGEL’s pretraining with more multimodal tokens, we observe consistent performance gains across understanding, generation, and editing tasks. Different capabilities emerge at distinct training stages—multimodal understanding and generation appear early, followed by basic editing, while complex, intelligent editing emerges later. This staged progression suggests an emergent pattern, where advanced multimodal reasoning builds on well-formed foundational skills. Ablation studies further show that combining VAE and ViT features significantly improves intelligent editing, underscoring the importance of visual-semantic context in enabling complex multimodal reasoning and further supporting its role in the emergence of advanced capabilities.
111
 
 
 
112
  ## 📊 Benchmarks
113
  ### 1. Visual Understanding
114
  | Model | MME ↑ | MMBench ↑ | MMMU ↑ | MM-Vet ↑ | MathVista ↑ |