lerobot
/

pi0_base

Safetensors

Model card Files Files and versions

xet

Community

pepijn223 HF Staff commited on Sep 23

Commit

ed5cf2c

verified ·

1 Parent(s): 9cd682f

Update README.md

Browse files

Files changed (1) hide show

README.md +36 -49

README.md CHANGED Viewed

@@ -1,21 +1,44 @@
-# π₀ - Base
-This is a PyTorch version of the PI0 `pi0_base model`, converted from the original JAX/Flax implementation.
-## Model Details
-- **Architecture**: PI0 (Vision-Language-Action model)
-- **Model Type**: PI0
-- **Domain**: Base model (general purpose)
-- **Precision**: 32-bit floating point (fp32)
-- **Vision Model**: PaliGemma (gemma_2b)
-- **Action Expert**: gemma_300m
-## Key Features
-- **Vision-Language-Action**: Multimodal model combining vision, language, and action
-- **PaliGemma Backbone**: Leverages PaliGemma for vision-language understanding
-- **Continuous State Input**: Direct continuous state input processing
 ## Conversion Details
@@ -29,42 +52,6 @@ python examples/convert_jax_model_to_pytorch.py \
     --precision float32
 ```
-## Usage
-```python
-from openpi.models_pytorch.pi0_pytorch import PI0Pytorch
-import torch
-# Load the model
-model = PI0Pytorch.from_pretrained("pepijn223/pi0_base_fp32")
-# The model expects inputs in the format:
-# - images: torch.Tensor of shape [batch, height, width, channels]
-# - text: tokenized text prompts
-# - proprioceptive_state: robot state information (if applicable)
-```
-## Model Architecture
-The model consists of:
-1. **Vision Encoder**: PaliGemma-based vision processing
-2. **Language Encoder**: Text prompt understanding
-3. **Action Expert**: Specialized network for action prediction
-4. **Integration Layer**: Combines multimodal information for action output
-## Training Data
-This model was trained on robotics datasets appropriate for its domain:
-- **DROID models**: Trained on diverse robot manipulation data
-- **LIBERO models**: Trained on diverse tabletop manipulation scenarios
-- **Base models**: Trained on general robotics datasets
-## Limitations
-- Model performance depends on similarity between deployment and training environments
-- May require domain-specific fine-tuning for optimal performance
-- Action space must match the trained action dimension (32)
 ## Citation
 If you use this model, please cite the original OpenPI work:

+# π₀ (Pi0)
+π₀ is a **Vision-Language-Action model for general robot control**, from Physical Intelligence. The LeRobot implementation is adapted from their open source [OpenPI](https://github.com/Physical-Intelligence/openpi) repository.
+## Model Overview
+π₀ represents a breakthrough in robotics as the first general-purpose robot foundation model developed by [Physical Intelligence](https://www.physicalintelligence.company/blog/pi0). Unlike traditional robots that are narrow specialists programmed for repetitive motions, π₀ is designed to be a generalist policy that can understand visual inputs, interpret natural language instructions, and control a variety of different robots across diverse tasks.
+### The Vision for Physical Intelligence
+As described by Physical Intelligence, while AI has achieved remarkable success in digital domains, from chess-playing to drug discovery, human intelligence still dramatically outpaces AI in the physical world. To paraphrase Moravec's paradox, winning a game of chess represents an "easy" problem for AI, but folding a shirt or cleaning up a table requires solving some of the most difficult engineering problems ever conceived. π₀ represents a first step toward developing artificial physical intelligence that enables users to simply ask robots to perform any task they want, just like they can with large language models.
+### Architecture and Approach
+π₀ combines several key innovations:
+- **Flow Matching**: Uses a novel method to augment pre-trained VLMs with continuous action outputs via flow matching (a variant of diffusion models)
+- **Cross-Embodiment Training**: Trained on data from 8 distinct robot platforms including UR5e, Bimanual UR5e, Franka, Bimanual Trossen, Bimanual ARX, Mobile Trossen, and Mobile Fibocom
+- **Internet-Scale Pre-training**: Inherits semantic knowledge from a pre-trained 3B parameter Vision-Language Model
+- **High-Frequency Control**: Outputs motor commands at up to 50 Hz for real-time dexterous manipulation
+## Training
+For training π₀, you can use the standard LeRobot training script with the appropriate configuration:
+```bash
+python src/lerobot/scripts/train.py \
+    --dataset.repo_id=your_dataset \
+    --policy.type=pi0 \
+    --output_dir=./outputs/pi0_training \
+    --job_name=pi0_training \
+    --policy.pretrained_path=pepijn223/pi0_base \
+    --policy.repo_id=your_repo_id \
+    --policy.compile_model=true \
+    --policy.gradient_checkpointing=true \
+    --policy.dtype=bfloat16 \
+    --steps=3000 \
+    --policy.scheduler_decay_steps=3000 \
+    --policy.device=cuda \
+    --batch_size=32
+```
 ## Conversion Details
     --precision float32
 ```
 ## Citation
 If you use this model, please cite the original OpenPI work: