Update README.md
Browse files
README.md
CHANGED
|
@@ -1,21 +1,44 @@
|
|
| 1 |
-
# Οβ
|
| 2 |
|
| 3 |
-
|
| 4 |
|
| 5 |
-
## Model
|
| 6 |
|
| 7 |
-
|
| 8 |
-
- **Model Type**: PI0
|
| 9 |
-
- **Domain**: Base model (general purpose)
|
| 10 |
-
- **Precision**: 32-bit floating point (fp32)
|
| 11 |
-
- **Vision Model**: PaliGemma (gemma_2b)
|
| 12 |
-
- **Action Expert**: gemma_300m
|
| 13 |
|
| 14 |
-
|
| 15 |
|
| 16 |
-
|
| 17 |
-
|
| 18 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 19 |
|
| 20 |
## Conversion Details
|
| 21 |
|
|
@@ -29,42 +52,6 @@ python examples/convert_jax_model_to_pytorch.py \
|
|
| 29 |
--precision float32
|
| 30 |
```
|
| 31 |
|
| 32 |
-
## Usage
|
| 33 |
-
|
| 34 |
-
```python
|
| 35 |
-
from openpi.models_pytorch.pi0_pytorch import PI0Pytorch
|
| 36 |
-
import torch
|
| 37 |
-
|
| 38 |
-
# Load the model
|
| 39 |
-
model = PI0Pytorch.from_pretrained("pepijn223/pi0_base_fp32")
|
| 40 |
-
|
| 41 |
-
# The model expects inputs in the format:
|
| 42 |
-
# - images: torch.Tensor of shape [batch, height, width, channels]
|
| 43 |
-
# - text: tokenized text prompts
|
| 44 |
-
# - proprioceptive_state: robot state information (if applicable)
|
| 45 |
-
```
|
| 46 |
-
|
| 47 |
-
## Model Architecture
|
| 48 |
-
|
| 49 |
-
The model consists of:
|
| 50 |
-
1. **Vision Encoder**: PaliGemma-based vision processing
|
| 51 |
-
2. **Language Encoder**: Text prompt understanding
|
| 52 |
-
3. **Action Expert**: Specialized network for action prediction
|
| 53 |
-
4. **Integration Layer**: Combines multimodal information for action output
|
| 54 |
-
|
| 55 |
-
## Training Data
|
| 56 |
-
|
| 57 |
-
This model was trained on robotics datasets appropriate for its domain:
|
| 58 |
-
- **DROID models**: Trained on diverse robot manipulation data
|
| 59 |
-
- **LIBERO models**: Trained on diverse tabletop manipulation scenarios
|
| 60 |
-
- **Base models**: Trained on general robotics datasets
|
| 61 |
-
|
| 62 |
-
## Limitations
|
| 63 |
-
|
| 64 |
-
- Model performance depends on similarity between deployment and training environments
|
| 65 |
-
- May require domain-specific fine-tuning for optimal performance
|
| 66 |
-
- Action space must match the trained action dimension (32)
|
| 67 |
-
|
| 68 |
## Citation
|
| 69 |
|
| 70 |
If you use this model, please cite the original OpenPI work:
|
|
|
|
| 1 |
+
# Οβ (Pi0)
|
| 2 |
|
| 3 |
+
Οβ is a **Vision-Language-Action model for general robot control**, from Physical Intelligence. The LeRobot implementation is adapted from their open source [OpenPI](https://github.com/Physical-Intelligence/openpi) repository.
|
| 4 |
|
| 5 |
+
## Model Overview
|
| 6 |
|
| 7 |
+
Οβ represents a breakthrough in robotics as the first general-purpose robot foundation model developed by [Physical Intelligence](https://www.physicalintelligence.company/blog/pi0). Unlike traditional robots that are narrow specialists programmed for repetitive motions, Οβ is designed to be a generalist policy that can understand visual inputs, interpret natural language instructions, and control a variety of different robots across diverse tasks.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 8 |
|
| 9 |
+
### The Vision for Physical Intelligence
|
| 10 |
|
| 11 |
+
As described by Physical Intelligence, while AI has achieved remarkable success in digital domains, from chess-playing to drug discovery, human intelligence still dramatically outpaces AI in the physical world. To paraphrase Moravec's paradox, winning a game of chess represents an "easy" problem for AI, but folding a shirt or cleaning up a table requires solving some of the most difficult engineering problems ever conceived. Οβ represents a first step toward developing artificial physical intelligence that enables users to simply ask robots to perform any task they want, just like they can with large language models.
|
| 12 |
+
|
| 13 |
+
### Architecture and Approach
|
| 14 |
+
|
| 15 |
+
Οβ combines several key innovations:
|
| 16 |
+
|
| 17 |
+
- **Flow Matching**: Uses a novel method to augment pre-trained VLMs with continuous action outputs via flow matching (a variant of diffusion models)
|
| 18 |
+
- **Cross-Embodiment Training**: Trained on data from 8 distinct robot platforms including UR5e, Bimanual UR5e, Franka, Bimanual Trossen, Bimanual ARX, Mobile Trossen, and Mobile Fibocom
|
| 19 |
+
- **Internet-Scale Pre-training**: Inherits semantic knowledge from a pre-trained 3B parameter Vision-Language Model
|
| 20 |
+
- **High-Frequency Control**: Outputs motor commands at up to 50 Hz for real-time dexterous manipulation
|
| 21 |
+
|
| 22 |
+
## Training
|
| 23 |
+
|
| 24 |
+
For training Οβ, you can use the standard LeRobot training script with the appropriate configuration:
|
| 25 |
+
|
| 26 |
+
```bash
|
| 27 |
+
python src/lerobot/scripts/train.py \
|
| 28 |
+
--dataset.repo_id=your_dataset \
|
| 29 |
+
--policy.type=pi0 \
|
| 30 |
+
--output_dir=./outputs/pi0_training \
|
| 31 |
+
--job_name=pi0_training \
|
| 32 |
+
--policy.pretrained_path=pepijn223/pi0_base \
|
| 33 |
+
--policy.repo_id=your_repo_id \
|
| 34 |
+
--policy.compile_model=true \
|
| 35 |
+
--policy.gradient_checkpointing=true \
|
| 36 |
+
--policy.dtype=bfloat16 \
|
| 37 |
+
--steps=3000 \
|
| 38 |
+
--policy.scheduler_decay_steps=3000 \
|
| 39 |
+
--policy.device=cuda \
|
| 40 |
+
--batch_size=32
|
| 41 |
+
```
|
| 42 |
|
| 43 |
## Conversion Details
|
| 44 |
|
|
|
|
| 52 |
--precision float32
|
| 53 |
```
|
| 54 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 55 |
## Citation
|
| 56 |
|
| 57 |
If you use this model, please cite the original OpenPI work:
|