pepijn223 HF Staff commited on
Commit
ed5cf2c
Β·
verified Β·
1 Parent(s): 9cd682f

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +36 -49
README.md CHANGED
@@ -1,21 +1,44 @@
1
- # Ο€β‚€ - Base
2
 
3
- This is a PyTorch version of the PI0 `pi0_base model`, converted from the original JAX/Flax implementation.
4
 
5
- ## Model Details
6
 
7
- - **Architecture**: PI0 (Vision-Language-Action model)
8
- - **Model Type**: PI0
9
- - **Domain**: Base model (general purpose)
10
- - **Precision**: 32-bit floating point (fp32)
11
- - **Vision Model**: PaliGemma (gemma_2b)
12
- - **Action Expert**: gemma_300m
13
 
14
- ## Key Features
15
 
16
- - **Vision-Language-Action**: Multimodal model combining vision, language, and action
17
- - **PaliGemma Backbone**: Leverages PaliGemma for vision-language understanding
18
- - **Continuous State Input**: Direct continuous state input processing
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
19
 
20
  ## Conversion Details
21
 
@@ -29,42 +52,6 @@ python examples/convert_jax_model_to_pytorch.py \
29
  --precision float32
30
  ```
31
 
32
- ## Usage
33
-
34
- ```python
35
- from openpi.models_pytorch.pi0_pytorch import PI0Pytorch
36
- import torch
37
-
38
- # Load the model
39
- model = PI0Pytorch.from_pretrained("pepijn223/pi0_base_fp32")
40
-
41
- # The model expects inputs in the format:
42
- # - images: torch.Tensor of shape [batch, height, width, channels]
43
- # - text: tokenized text prompts
44
- # - proprioceptive_state: robot state information (if applicable)
45
- ```
46
-
47
- ## Model Architecture
48
-
49
- The model consists of:
50
- 1. **Vision Encoder**: PaliGemma-based vision processing
51
- 2. **Language Encoder**: Text prompt understanding
52
- 3. **Action Expert**: Specialized network for action prediction
53
- 4. **Integration Layer**: Combines multimodal information for action output
54
-
55
- ## Training Data
56
-
57
- This model was trained on robotics datasets appropriate for its domain:
58
- - **DROID models**: Trained on diverse robot manipulation data
59
- - **LIBERO models**: Trained on diverse tabletop manipulation scenarios
60
- - **Base models**: Trained on general robotics datasets
61
-
62
- ## Limitations
63
-
64
- - Model performance depends on similarity between deployment and training environments
65
- - May require domain-specific fine-tuning for optimal performance
66
- - Action space must match the trained action dimension (32)
67
-
68
  ## Citation
69
 
70
  If you use this model, please cite the original OpenPI work:
 
1
+ # Ο€β‚€ (Pi0)
2
 
3
+ Ο€β‚€ is a **Vision-Language-Action model for general robot control**, from Physical Intelligence. The LeRobot implementation is adapted from their open source [OpenPI](https://github.com/Physical-Intelligence/openpi) repository.
4
 
5
+ ## Model Overview
6
 
7
+ Ο€β‚€ represents a breakthrough in robotics as the first general-purpose robot foundation model developed by [Physical Intelligence](https://www.physicalintelligence.company/blog/pi0). Unlike traditional robots that are narrow specialists programmed for repetitive motions, Ο€β‚€ is designed to be a generalist policy that can understand visual inputs, interpret natural language instructions, and control a variety of different robots across diverse tasks.
 
 
 
 
 
8
 
9
+ ### The Vision for Physical Intelligence
10
 
11
+ As described by Physical Intelligence, while AI has achieved remarkable success in digital domains, from chess-playing to drug discovery, human intelligence still dramatically outpaces AI in the physical world. To paraphrase Moravec's paradox, winning a game of chess represents an "easy" problem for AI, but folding a shirt or cleaning up a table requires solving some of the most difficult engineering problems ever conceived. Ο€β‚€ represents a first step toward developing artificial physical intelligence that enables users to simply ask robots to perform any task they want, just like they can with large language models.
12
+
13
+ ### Architecture and Approach
14
+
15
+ Ο€β‚€ combines several key innovations:
16
+
17
+ - **Flow Matching**: Uses a novel method to augment pre-trained VLMs with continuous action outputs via flow matching (a variant of diffusion models)
18
+ - **Cross-Embodiment Training**: Trained on data from 8 distinct robot platforms including UR5e, Bimanual UR5e, Franka, Bimanual Trossen, Bimanual ARX, Mobile Trossen, and Mobile Fibocom
19
+ - **Internet-Scale Pre-training**: Inherits semantic knowledge from a pre-trained 3B parameter Vision-Language Model
20
+ - **High-Frequency Control**: Outputs motor commands at up to 50 Hz for real-time dexterous manipulation
21
+
22
+ ## Training
23
+
24
+ For training Ο€β‚€, you can use the standard LeRobot training script with the appropriate configuration:
25
+
26
+ ```bash
27
+ python src/lerobot/scripts/train.py \
28
+ --dataset.repo_id=your_dataset \
29
+ --policy.type=pi0 \
30
+ --output_dir=./outputs/pi0_training \
31
+ --job_name=pi0_training \
32
+ --policy.pretrained_path=pepijn223/pi0_base \
33
+ --policy.repo_id=your_repo_id \
34
+ --policy.compile_model=true \
35
+ --policy.gradient_checkpointing=true \
36
+ --policy.dtype=bfloat16 \
37
+ --steps=3000 \
38
+ --policy.scheduler_decay_steps=3000 \
39
+ --policy.device=cuda \
40
+ --batch_size=32
41
+ ```
42
 
43
  ## Conversion Details
44
 
 
52
  --precision float32
53
  ```
54
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
55
  ## Citation
56
 
57
  If you use this model, please cite the original OpenPI work: