mbreuss commited on
Commit
e982614
·
verified ·
1 Parent(s): 7c243ea

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +59 -30
README.md CHANGED
@@ -1,46 +1,75 @@
 
 
 
 
 
 
 
 
 
 
 
 
1
  # FlowerVLA - Vision-Language-Action Flow Model for {dataset_name}
2
 
3
- This is a pretrained FlowerVLA model for robotic manipulation trained on the {dataset_name} dataset. FlowerVLA is an efficient Vision-Language-Action Flow policy for robot learning.
 
4
 
5
- ## Model Description
6
 
7
- FlowerVLA is a novel architecture that:
8
- - Uses Florence-2 for multi-modal vision-language encoding
9
- - Employs a transformer-based flow matching architecture
10
- - Provides an efficient policy with ~1B parameters
11
- - Operates on action chunks for better long-horizon planning
12
 
13
- ## Usage
14
 
15
- ```python
16
- from huggingface_hub import snapshot_download
17
- import torch
18
- import hydra
19
- from omegaconf import OmegaConf
20
- import json
21
- import os
22
 
23
- model_path = snapshot_download(repo_id="{repo_id}")
 
 
24
 
25
- with open(os.path.join(model_path, "config.json")) as f:
26
- config = json.load(f)
27
 
28
- model_cfg = OmegaConf.create(config["model_config"])
29
- model_cfg["_target_"] = "flower.models.flower.FLOWERVLA"
30
 
31
- model = hydra.utils.instantiate(model_cfg)
 
 
 
32
 
33
- state_dict = torch.load(os.path.join(model_path, "model.pt"))
34
- model.load_state_dict(state_dict)
35
 
36
- model.eval()
37
 
38
- # obs = {...} # Your observation dict
39
- # goal = {"lang_text": "push the blue block to the right"}
40
- # action = model.step(obs, goal)
41
 
42
- @inproceedings{
43
- reuss2024multimodal,
44
- # Add citation when available
 
 
45
  }
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
46
 
 
1
+ ---
2
+ license: mit
3
+ language:
4
+ - en
5
+ base_model:
6
+ - microsoft/Florence-2-large
7
+ pipeline_tag: robotics
8
+ tags:
9
+ - robotics
10
+ - VLA
11
+ ---
12
+
13
  # FlowerVLA - Vision-Language-Action Flow Model for {dataset_name}
14
 
15
+ This is a pretrained FlowerVLA model for robotic manipulation trained on the {dataset_name} dataset.
16
+ Flower is an efficient Vision-Language-Action Flow policy for robot learning that only contains 1B parameters.
17
 
18
+ ## Model Description
19
 
20
+ FlowerVLA is a novel architecture that:
21
+ - Uses half of Florence-2 for multi-modal vision-language encoding
22
+ - Employs an novel transformer-based flow matching architecture
23
+ - Provides an efficient, versatile VLA policy with only ~1B parameters
 
24
 
25
+ ## Model Performance
26
 
27
+ This checkpoint contains weights for the CALVIN ABC challenge and currently ranks 1 with the following results:
 
 
 
 
 
 
28
 
29
+ | Train→Test | Method | 1 | 2 | 3 | 4 | 5 | **Avg. Len.** |
30
+ |------------|--------|---|---|---|---|---|---------------|
31
+ | {dataset_name} | FlowerVLA | 99.3% | 95.9% | 90.5% | 84.8% |77.5% | 4.54 |
32
 
 
 
33
 
34
+ ### Input/Output Specifications
 
35
 
36
+ #### Inputs
37
+ - RGB Static Camera: `(B, T, 3, H, W)` tensor
38
+ - RGB Gripper Camera: `(B, T, 3, H, W)` tensor
39
+ - Language Instructions: Text strings
40
 
41
+ #### Outputs
42
+ - Action Space: `(B, T, 7)` tensor representing delta EEF actions
43
 
44
+ ## Usage
45
 
46
+ Check out our full model implementation on Github [todo]() and follow the instructions in the readme to test the model on one of the environments.
 
 
47
 
48
+ ```python
49
+ obs = {
50
+ "rgb_obs": {
51
+ "rgb_static": static_image,
52
+ "rgb_gripper": gripper_image
53
  }
54
+ }
55
+ goal = {"lang_text": "pick up the blue cube"}
56
+ action = model.step(obs, goal)
57
+ ```
58
+
59
+ ## Training Details
60
+
61
+ ### Configuration
62
+ - **Optimizer**: AdamW
63
+ - **Learning Rate**: 2e-5
64
+ - **Weight Decay**: 0.05
65
+
66
+
67
+ @inproceedings{
68
+ reuss2025flower,
69
+ # Add citation when available
70
+ }
71
+
72
+
73
+ ## License
74
+ This model is released under the MIT license.
75