PEFT
Safetensors
parikshitmukh's picture
Update README.md
c116837 verified
---
base_model: Qwen/Qwen2.5-VL-7B-Instruct
library_name: peft
datasets:
- presightai/arabic_doc_to_markdown
---
# Arabic Image-to-Markdown LoRA Adapter for Qwen2.5-VL-7B-Instruct
---
### Model Card for Model ID
`presightai/arabic-image-to-markdown-qwen2.5vl-7b-instruct-lora`
---
## Model Details
- **Base Model**: [Qwen/Qwen2.5-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct)
- **Library**: PEFT (LoRA adapters)
- **Developed by**: Presight AI Technologies LLC
- **Finetuned from**: Qwen/Qwen2.5-VL-7B-Instruct
- **Languages**: Arabic (primary), English (secondary instructions)
- **License**: Apache 2.0 (inherits from base model)
- **Task**: Image-to-Markdown conversion on Arabic document images
---
## Model Sources
- **Repository**: https://huggingface.co/presightai/arabic-image-to-markdown-qwen2.5vl-7b-instruct-lora
- **Demo**: Coming soon (contact us for early testing)
---
## Model Description
This LoRA adapter fine-tunes the Qwen2.5-VL-7B-Instruct vision-language model to:
- Take **Arabic document images** as input.
- Generate **Markdown-formatted text** as output, preserving tables, sections, and layout.
It was trained on ~40,000 crawled Arabic government and institutional PDFs, preprocessed into image–markdown pairs using an in-house LLM pipeline.
---
## Uses
### Direct Use
- Converting scanned Arabic documents (PDF pages, images) to structured Markdown.
- Automating document digitization and layout reconstruction for Arabic corpora.
### Downstream Use
- Fine-tuning on other multilingual document types.
- Integrating into OCR + LLM pipelines for Arabic-language archives.
### Out-of-Scope Use
- Any medical, legal, or critical decision-making without human verification.
- Generating content on languages or formats not covered by the training set.
---
## ⚠ Bias, Risks, and Limitations
- Markdown outputs are LLM-generated and **may not perfectly match original image layouts**.
- OCR/LLM hallucinations may occur on poor-quality scans or unusual document types.
- The model was trained on crawled public data, which may contain inherent biases.
We **recommend** human review in all production deployments.
---
## Training Details
- **Dataset**: 40,000+ Arabic document pages, filtered for tabular and structured layouts.
- **Preprocessing**: PDFs split into individual images, paired with LLM-generated markdown annotations.
- **Training framework**: PEFT LoRA + TRL SFTTrainer.
- **Hyperparameters**:
- LoRA rank: 8
- LoRA alpha: 16
- Dropout: 0.05
- Optimizer: AdamW fused
- Batch size: 2 (with gradient accumulation 8)
- Epochs: 2
- Learning rate: 2e-4
- Train Token Size: 1024
---
## Evaluation
- **Testing data**: Held-out 10% Arabic document images
- **Metrics**: Eval loss (best checkpoint selected)
- **Results**: Consistent markdown structure generation; minor hallucinations in extreme cases
---
## Environmental Impact
- **Hardware**: A100 80GB GPU
- **Hours used**: ~5–6 hours
- **Cloud provider**: Presight internal cloud
- **Region**: UAE
- **Carbon estimate**: [Pending calculation]
---
## Technical Specifications
- **Model type**: Vision-Language Causal LM with LoRA adapter
- **Framework**: PEFT 0.14.0, Transformers, TRL SFTTrainer
- **Compute infra**: Multi-GPU A100 cluster (single node used for this run)
---
## How to Get Started
pip install torch==2.6.0 torchvision==0.21.0
#### Install requirements
accelerate==1.7.0
av==14.4.0
certifi==2025.4.26
charset-normalizer==3.4.2
einops==0.8.1
filelock==3.18.0
flash_attn==2.7.4.post1
fsspec==2025.5.1
hf-xet==1.1.2
huggingface-hub==0.32.2
idna==3.10
Jinja2==3.1.6
MarkupSafe==3.0.2
mpmath==1.3.0
networkx==3.4.2
ninja==1.11.1.4
numpy==1.26.4
nvidia-cublas-cu12==12.4.5.8
nvidia-cuda-cupti-cu12==12.4.127
nvidia-cuda-nvrtc-cu12==12.4.127
nvidia-cuda-runtime-cu12==12.4.127
nvidia-cudnn-cu12==9.1.0.70
nvidia-cufft-cu12==11.2.1.3
nvidia-curand-cu12==10.3.5.147
nvidia-cusolver-cu12==11.6.1.9
nvidia-cusparse-cu12==12.3.1.170
nvidia-cusparselt-cu12==0.6.2
nvidia-nccl-cu12==2.21.5
nvidia-nvjitlink-cu12==12.4.127
nvidia-nvtx-cu12==12.4.127
optimum==1.25.3
packaging==25.0
peft==0.14.0
pillow==11.2.1
psutil==7.0.0
PyYAML==6.0.2
qwen-vl-utils==0.0.8
regex==2024.11.6
requests==2.32.3
safetensors==0.5.3
sympy==1.13.1
tokenizers==0.21.1
tqdm==4.67.1
transformers==4.49.0
triton==3.2.0
typing_extensions==4.13.2
urllib3==2.4.0
### Inference Example
```python
import argparse
import torch
from PIL import Image
from transformers import AutoProcessor, Qwen2_5_VLForConditionalGeneration
from qwen_vl_utils import process_vision_info
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "1" # or "3"
def load_model_with_qlora(model_repo):
print(f"Loading model from: {model_repo}")
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
model_repo,
attn_implementation="flash_attention_2",
device_map="auto",
torch_dtype=torch.float16,
use_cache=True
)
processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-7B-Instruct", use_fast=True)
print("Model and processor loaded.")
return model, processor
@torch.no_grad()
def inference(model, processor, image_path, question):
image = Image.open(image_path).convert("RGB")
# Build input sample
example = {
'content': [
{'type': 'image', 'image': image},
{'type': 'text', 'text': question}
]
}
# Process image + prompt separately
image_input = process_vision_info([example])[0]
text_input = processor.apply_chat_template([example])
inputs = processor(
text=text_input,
images=image_input,
return_tensors="pt",
padding=True
).to(model.device)
outputs = model.generate(
**inputs,
max_new_tokens=1024,
do_sample=False,
pad_token_id=processor.tokenizer.pad_token_id,
eos_token_id=processor.tokenizer.eos_token_id,
use_cache=True,
num_beams=1
)
decoded = processor.decode(outputs[0], skip_special_tokens=True)
return decoded.strip()
if __name__ == "__main__":
parser = argparse.ArgumentParser(description="Qwen2.5-VL Inference Script")
parser.add_argument("--model_repo", type=str, required=True, help="Hugging Face repo, e.g., presightai/arabic-image-to-markdown-qwen2.5vl-7b-instruct-lora")
parser.add_argument("--image_path", type=str, required=True, help="Path to input image (e.g., input.jpg)")
parser.add_argument("--question", type=str, default="Extract in markdown <image>", help="Question/prompt")
args = parser.parse_args()
model, processor = load_model_with_qlora(args.model_repo)
result = inference(model, processor, args.image_path, args.question)
print("\n=== Model Output ===")
print(result)
```
## Command
```bash
python ocr.py --model_repo presightai/arabic-image-to-markdown-qwen2.5vl-7b-instruct-lora --image_path input.jpg
```
## Inference API and Smart Chunking to handle large token size
Due to token size limitation of small model We built a **FastAPI service** that:
- Automatically resizes large input images to safe dimensions.
- Chunks images horizontally to fit maximum token/image limits.
- Runs batch inference when possible or falls back to per-chunk inference.
- Cleans and merges chunk outputs into a final full-page Markdown.
---
### Behind the Scenes
- Pads chunks to uniform size for batching.
- Uses Flash Attention and fp16 for fast decoding.
- Applies post-cleanup to strip redundant tokens, system prompts, and format artifacts.