π MonkeyOCR-MLX: Apple Silicon Optimized OCR
A high-performance OCR application optimized for Apple Silicon with MLX-VLM acceleration, featuring advanced document layout analysis and intelligent text extraction.
π₯ Key Features
- β‘ MLX-VLM Optimization: Native Apple Silicon acceleration using MLX framework
- π 3x Faster Processing: Compared to standard PyTorch on M-series chips
- π§ Advanced AI: Powered by Qwen2.5-VL model with specialized layout analysis
- π Multi-format Support: PDF, PNG, JPG, JPEG with intelligent structure detection
- π Modern Web Interface: Beautiful Gradio interface for easy document processing
- π Batch Processing: Efficient handling of multiple documents
- π― High Accuracy: Specialized for complex financial documents and tables
- π 100% Private: All processing happens locally on your Mac
π Performance Benchmarks
Test: Complex Financial Document (Tax Form)
- MLX-VLM: ~15-18 seconds β‘
- Standard PyTorch: ~25-30 seconds
- CPU Only: ~60-90 seconds
MacBook M4 Pro Performance:
- Model loading: ~1.7s
- Text extraction: ~15s
- Table structure: ~18s
- Memory usage: ~13GB peak
π Installation
Prerequisites
- macOS with Apple Silicon (M1/M2/M3/M4)
- Python 3.11+
- 16GB+ RAM (32GB+ recommended for large documents)
Quick Setup
Clone the repository:
git clone https://huggingface.co/Jimmi42/MonkeyOCR-Apple-Silicon cd MonkeyOCR-Apple-Silicon
Run the automated setup script:
chmod +x setup.sh ./setup.sh
This script will automatically:
- Download MonkeyOCR from the official GitHub repository
- Apply MLX-VLM optimization patches for Apple Silicon
- Enable smart backend auto-selection (MLX/LMDeploy/transformers)
- Install UV package manager if needed
- Set up virtual environment with Python 3.11
- Install all dependencies including MLX-VLM
- Download required model weights
- Configure optimal backend for your hardware
Alternative manual installation:
# Install UV if not already installed curl -LsSf https://astral.sh/uv/install.sh | sh # Download MonkeyOCR git clone https://github.com/Yuliang-Liu/MonkeyOCR.git MonkeyOCR # Install dependencies (includes mlx-vlm) uv sync # Download models cd MonkeyOCR && python tools/download_model.py && cd ..
πββοΈ Usage
Web Interface (Recommended)
# Activate virtual environment
source .venv/bin/activate # or `uv shell`
# Start the web app
python app.py
Access the interface at http://localhost:7861
Command Line
python main.py path/to/document.pdf
βοΈ Configuration
Smart Backend Selection (Default)
The app automatically detects your hardware and selects the optimal backend:
# model_configs_mps.yaml
device: mps
chat_config:
backend: auto # Smart auto-selection
batch_size: 1
max_new_tokens: 256
temperature: 0.0
Auto-Selection Logic:
- π Apple Silicon (MPS) β MLX-VLM (3x faster)
- π₯οΈ CUDA GPU β LMDeploy (optimized for NVIDIA)
- π» CPU/Fallback β Transformers (universal compatibility)
Performance Backends
Backend | Speed | Memory | Best For | Auto-Selected |
---|---|---|---|---|
auto |
β‘ | π§ | All systems (Recommended) | β Default |
mlx |
πππ | π’ | Apple Silicon | π Auto for MPS |
lmdeploy |
ππ | π‘ | CUDA systems | π₯οΈ Auto for CUDA |
transformers |
π | π’ | Universal fallback | π» Auto for CPU |
π§ Model Architecture
Core Components
- Layout Detection: DocLayout-YOLO for document structure analysis
- Vision-Language Model: Qwen2.5-VL with MLX optimization
- Layout Reading: LayoutReader for reading order optimization
- MLX Framework: Native Apple Silicon acceleration
Apple Silicon Optimizations
- Metal Performance Shaders: Direct GPU acceleration
- Unified Memory: Optimized memory access patterns
- Neural Engine: Utilizes Apple's dedicated AI hardware
- Float16 Precision: Optimal speed/accuracy balance
π― Perfect For
Document Types:
- π Financial Documents: Tax forms, invoices, statements
- π Legal Documents: Contracts, forms, certificates
- π Academic Papers: Research papers, articles
- π’ Business Documents: Reports, presentations, spreadsheets
Advanced Features:
- β Complex table extraction with highlighted cells
- β Multi-column layouts and mixed content
- β Mathematical formulas and equations
- β Structured data output (Markdown, JSON)
- β Batch processing for multiple files
π¨ Troubleshooting
MLX-VLM Issues
# Test MLX-VLM availability
python -c "import mlx_vlm; print('β
MLX-VLM available')"
# Check if auto backend selection is working
python -c "
from MonkeyOCR.magic_pdf.model.custom_model import MonkeyOCR
model = MonkeyOCR('model_configs_mps.yaml')
print(f'Selected backend: {type(model.chat_model).__name__}')
"
Performance Issues
# Check MPS availability
python -c "import torch; print(f'MPS available: {torch.backends.mps.is_available()}')"
# Monitor memory usage during processing
top -pid $(pgrep -f "python app.py")
Common Solutions
Patches Not Applied:
- Re-run
./setup.sh
to reapply patches - Check that
MonkeyOCR
directory exists and has our modifications - Verify
MonkeyChat_MLX
class exists inMonkeyOCR/magic_pdf/model/custom_model.py
- Re-run
Wrong Backend Selected:
- Check hardware detection with
python -c "import torch; print(torch.backends.mps.is_available())"
- Verify MLX-VLM is installed:
pip install mlx-vlm
- Use
backend: mlx
in config to force MLX backend
- Check hardware detection with
Slow Performance:
- Ensure auto-selection chose MLX backend on Apple Silicon
- Check Activity Monitor for MPS GPU usage
- Verify
backend: auto
in model_configs_mps.yaml
Memory Issues:
- Reduce image resolution before processing
- Close other memory-intensive applications
- Reduce batch_size to 1 in config
Port Already in Use:
GRADIO_SERVER_PORT=7862 python app.py
π Project Structure
MonkeyOCR-MLX/
βββ π app.py # Gradio web interface
βββ π₯οΈ main.py # CLI interface
βββ βοΈ model_configs_mps.yaml # MLX-optimized config
βββ π¦ requirements.txt # Dependencies (includes mlx-vlm)
βββ π οΈ torch_patch.py # Compatibility patches
βββ π§ MonkeyOCR/ # Core AI models
β βββ π― magic_pdf/ # Processing engine
βββ π .gitignore # Git ignore rules
βββ π README.md # This file
π₯ What's New in MLX Version
- β¨ Smart Patching System: Automatically applies MLX-VLM optimizations to official MonkeyOCR
- π§ Intelligent Backend Selection: Auto-detects hardware and selects optimal backend
- π 3x Faster Processing: MLX-VLM acceleration on Apple Silicon
- πΎ Better Memory Efficiency: Optimized for unified memory architecture
- π― Improved Accuracy: Enhanced table and structure detection
- π§ Zero Configuration: Works out-of-the-box with smart defaults
- π Performance Monitoring: Built-in timing and metrics
- π οΈ Latest Fix (June 2025): Resolved MLX-VLM prompt formatting for optimal OCR output
- π Always Up-to-Date: Uses official MonkeyOCR repository with our patches applied
π¬ Technical Implementation
Smart Patching System
- Dynamic Code Injection: Automatically adds MLX-VLM class to official MonkeyOCR
- Backend Selection Logic: Patches smart hardware detection into initialization
- Zero Maintenance: Always uses latest official MonkeyOCR with our optimizations
- Seamless Integration: Patches are applied transparently during setup
MLX-VLM Backend (MonkeyChat_MLX
)
- Direct MLX framework integration
- Optimized for Apple's Metal Performance Shaders
- Native unified memory management
- Specialized prompt processing for OCR tasks
- Fixed prompt formatting for optimal output quality
Intelligent Fallback System
- Hardware Detection: MPS β MLX, CUDA β LMDeploy, CPU β Transformers
- Graceful Degradation: Falls back to compatible backends if preferred unavailable
- Cross-Platform: Maintains compatibility across all systems
- Error Recovery: Automatic fallback on initialization failures
π€ Contributing
We welcome contributions! Please:
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature
) - Commit changes (
git commit -m 'Add amazing feature'
) - Push to branch (
git push origin feature/amazing-feature
) - Open a Pull Request
π License
This project is licensed under the MIT License - see the LICENSE file for details.
π Acknowledgments
- Apple MLX Team: For the incredible MLX framework
- MonkeyOCR Team: For the foundational OCR model
- Qwen Team: For the excellent Qwen2.5-VL model
- Gradio Team: For the beautiful web interface
- MLX-VLM Contributors: For the MLX vision-language integration
π Support
- π Bug Reports: Create an issue
- π¬ Discussions: Hugging Face Discussions
- π Documentation: Check the troubleshooting section above
- β Star the repository if you find it useful!
π Supercharged for Apple Silicon β’ Made with β€οΈ for the MLX Community
Experience the future of OCR with native Apple Silicon optimization