|
--- |
|
base_model: Qwen/Qwen2.5-VL-7B-Instruct |
|
library_name: peft |
|
datasets: |
|
- presightai/arabic_doc_to_markdown |
|
--- |
|
|
|
# Arabic Image-to-Markdown LoRA Adapter for Qwen2.5-VL-7B-Instruct |
|
|
|
--- |
|
|
|
### Model Card for Model ID |
|
`presightai/arabic-image-to-markdown-qwen2.5vl-7b-instruct-lora` |
|
|
|
--- |
|
|
|
## Model Details |
|
|
|
- **Base Model**: [Qwen/Qwen2.5-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct) |
|
- **Library**: PEFT (LoRA adapters) |
|
- **Developed by**: Presight AI Technologies LLC |
|
- **Finetuned from**: Qwen/Qwen2.5-VL-7B-Instruct |
|
- **Languages**: Arabic (primary), English (secondary instructions) |
|
- **License**: Apache 2.0 (inherits from base model) |
|
- **Task**: Image-to-Markdown conversion on Arabic document images |
|
|
|
--- |
|
|
|
## Model Sources |
|
|
|
- **Repository**: https://huggingface.co/presightai/arabic-image-to-markdown-qwen2.5vl-7b-instruct-lora |
|
- **Demo**: Coming soon (contact us for early testing) |
|
|
|
--- |
|
|
|
## Model Description |
|
|
|
This LoRA adapter fine-tunes the Qwen2.5-VL-7B-Instruct vision-language model to: |
|
- Take **Arabic document images** as input. |
|
- Generate **Markdown-formatted text** as output, preserving tables, sections, and layout. |
|
|
|
It was trained on ~40,000 crawled Arabic government and institutional PDFs, preprocessed into image–markdown pairs using an in-house LLM pipeline. |
|
|
|
--- |
|
|
|
## Uses |
|
|
|
### Direct Use |
|
- Converting scanned Arabic documents (PDF pages, images) to structured Markdown. |
|
- Automating document digitization and layout reconstruction for Arabic corpora. |
|
|
|
### Downstream Use |
|
- Fine-tuning on other multilingual document types. |
|
- Integrating into OCR + LLM pipelines for Arabic-language archives. |
|
|
|
### Out-of-Scope Use |
|
- Any medical, legal, or critical decision-making without human verification. |
|
- Generating content on languages or formats not covered by the training set. |
|
|
|
--- |
|
|
|
## ⚠ Bias, Risks, and Limitations |
|
|
|
- Markdown outputs are LLM-generated and **may not perfectly match original image layouts**. |
|
- OCR/LLM hallucinations may occur on poor-quality scans or unusual document types. |
|
- The model was trained on crawled public data, which may contain inherent biases. |
|
|
|
We **recommend** human review in all production deployments. |
|
|
|
--- |
|
## Training Details |
|
|
|
- **Dataset**: 40,000+ Arabic document pages, filtered for tabular and structured layouts. |
|
- **Preprocessing**: PDFs split into individual images, paired with LLM-generated markdown annotations. |
|
- **Training framework**: PEFT LoRA + TRL SFTTrainer. |
|
- **Hyperparameters**: |
|
- LoRA rank: 8 |
|
- LoRA alpha: 16 |
|
- Dropout: 0.05 |
|
- Optimizer: AdamW fused |
|
- Batch size: 2 (with gradient accumulation 8) |
|
- Epochs: 2 |
|
- Learning rate: 2e-4 |
|
- Train Token Size: 1024 |
|
|
|
--- |
|
|
|
## Evaluation |
|
|
|
- **Testing data**: Held-out 10% Arabic document images |
|
- **Metrics**: Eval loss (best checkpoint selected) |
|
- **Results**: Consistent markdown structure generation; minor hallucinations in extreme cases |
|
|
|
--- |
|
|
|
## Environmental Impact |
|
|
|
- **Hardware**: A100 80GB GPU |
|
- **Hours used**: ~5–6 hours |
|
- **Cloud provider**: Presight internal cloud |
|
- **Region**: UAE |
|
- **Carbon estimate**: [Pending calculation] |
|
|
|
--- |
|
|
|
## Technical Specifications |
|
|
|
- **Model type**: Vision-Language Causal LM with LoRA adapter |
|
- **Framework**: PEFT 0.14.0, Transformers, TRL SFTTrainer |
|
- **Compute infra**: Multi-GPU A100 cluster (single node used for this run) |
|
|
|
--- |
|
## How to Get Started |
|
|
|
pip install torch==2.6.0 torchvision==0.21.0 |
|
|
|
#### Install requirements |
|
accelerate==1.7.0 |
|
av==14.4.0 |
|
certifi==2025.4.26 |
|
charset-normalizer==3.4.2 |
|
einops==0.8.1 |
|
filelock==3.18.0 |
|
flash_attn==2.7.4.post1 |
|
fsspec==2025.5.1 |
|
hf-xet==1.1.2 |
|
huggingface-hub==0.32.2 |
|
idna==3.10 |
|
Jinja2==3.1.6 |
|
MarkupSafe==3.0.2 |
|
mpmath==1.3.0 |
|
networkx==3.4.2 |
|
ninja==1.11.1.4 |
|
numpy==1.26.4 |
|
nvidia-cublas-cu12==12.4.5.8 |
|
nvidia-cuda-cupti-cu12==12.4.127 |
|
nvidia-cuda-nvrtc-cu12==12.4.127 |
|
nvidia-cuda-runtime-cu12==12.4.127 |
|
nvidia-cudnn-cu12==9.1.0.70 |
|
nvidia-cufft-cu12==11.2.1.3 |
|
nvidia-curand-cu12==10.3.5.147 |
|
nvidia-cusolver-cu12==11.6.1.9 |
|
nvidia-cusparse-cu12==12.3.1.170 |
|
nvidia-cusparselt-cu12==0.6.2 |
|
nvidia-nccl-cu12==2.21.5 |
|
nvidia-nvjitlink-cu12==12.4.127 |
|
nvidia-nvtx-cu12==12.4.127 |
|
optimum==1.25.3 |
|
packaging==25.0 |
|
peft==0.14.0 |
|
pillow==11.2.1 |
|
psutil==7.0.0 |
|
PyYAML==6.0.2 |
|
qwen-vl-utils==0.0.8 |
|
regex==2024.11.6 |
|
requests==2.32.3 |
|
safetensors==0.5.3 |
|
sympy==1.13.1 |
|
tokenizers==0.21.1 |
|
tqdm==4.67.1 |
|
transformers==4.49.0 |
|
triton==3.2.0 |
|
typing_extensions==4.13.2 |
|
urllib3==2.4.0 |
|
|
|
|
|
### Inference Example |
|
|
|
```python |
|
import argparse |
|
import torch |
|
from PIL import Image |
|
from transformers import AutoProcessor, Qwen2_5_VLForConditionalGeneration |
|
from qwen_vl_utils import process_vision_info |
|
|
|
import os |
|
os.environ["CUDA_VISIBLE_DEVICES"] = "1" # or "3" |
|
|
|
def load_model_with_qlora(model_repo): |
|
print(f"Loading model from: {model_repo}") |
|
model = Qwen2_5_VLForConditionalGeneration.from_pretrained( |
|
model_repo, |
|
attn_implementation="flash_attention_2", |
|
device_map="auto", |
|
torch_dtype=torch.float16, |
|
use_cache=True |
|
) |
|
processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-7B-Instruct", use_fast=True) |
|
print("Model and processor loaded.") |
|
return model, processor |
|
|
|
@torch.no_grad() |
|
def inference(model, processor, image_path, question): |
|
image = Image.open(image_path).convert("RGB") |
|
|
|
# Build input sample |
|
example = { |
|
'content': [ |
|
{'type': 'image', 'image': image}, |
|
{'type': 'text', 'text': question} |
|
] |
|
} |
|
|
|
# Process image + prompt separately |
|
image_input = process_vision_info([example])[0] |
|
text_input = processor.apply_chat_template([example]) |
|
|
|
inputs = processor( |
|
text=text_input, |
|
images=image_input, |
|
return_tensors="pt", |
|
padding=True |
|
).to(model.device) |
|
|
|
outputs = model.generate( |
|
**inputs, |
|
max_new_tokens=1024, |
|
do_sample=False, |
|
pad_token_id=processor.tokenizer.pad_token_id, |
|
eos_token_id=processor.tokenizer.eos_token_id, |
|
use_cache=True, |
|
num_beams=1 |
|
) |
|
|
|
decoded = processor.decode(outputs[0], skip_special_tokens=True) |
|
return decoded.strip() |
|
|
|
if __name__ == "__main__": |
|
parser = argparse.ArgumentParser(description="Qwen2.5-VL Inference Script") |
|
parser.add_argument("--model_repo", type=str, required=True, help="Hugging Face repo, e.g., presightai/arabic-image-to-markdown-qwen2.5vl-7b-instruct-lora") |
|
parser.add_argument("--image_path", type=str, required=True, help="Path to input image (e.g., input.jpg)") |
|
parser.add_argument("--question", type=str, default="Extract in markdown <image>", help="Question/prompt") |
|
|
|
args = parser.parse_args() |
|
|
|
model, processor = load_model_with_qlora(args.model_repo) |
|
result = inference(model, processor, args.image_path, args.question) |
|
|
|
print("\n=== Model Output ===") |
|
print(result) |
|
``` |
|
## Command |
|
```bash |
|
python ocr.py --model_repo presightai/arabic-image-to-markdown-qwen2.5vl-7b-instruct-lora --image_path input.jpg |
|
``` |
|
|
|
## Inference API and Smart Chunking to handle large token size |
|
|
|
Due to token size limitation of small model We built a **FastAPI service** that: |
|
|
|
- Automatically resizes large input images to safe dimensions. |
|
- Chunks images horizontally to fit maximum token/image limits. |
|
- Runs batch inference when possible or falls back to per-chunk inference. |
|
- Cleans and merges chunk outputs into a final full-page Markdown. |
|
|
|
--- |
|
|
|
### Behind the Scenes |
|
|
|
- Pads chunks to uniform size for batching. |
|
- Uses Flash Attention and fp16 for fast decoding. |
|
- Applies post-cleanup to strip redundant tokens, system prompts, and format artifacts. |
|
|
|
|
|
|