Arabic Image-to-Markdown LoRA Adapter for Qwen2.5-VL-7B-Instruct
Model Card for Model ID
presightai/arabic-image-to-markdown-qwen2.5vl-7b-instruct-lora
Model Details
- Base Model: Qwen/Qwen2.5-VL-7B-Instruct
- Library: PEFT (LoRA adapters)
- Developed by: Presight AI Technologies LLC
- Finetuned from: Qwen/Qwen2.5-VL-7B-Instruct
- Languages: Arabic (primary), English (secondary instructions)
- License: Apache 2.0 (inherits from base model)
- Task: Image-to-Markdown conversion on Arabic document images
Model Sources
- Repository: https://huggingface.co/presightai/arabic-image-to-markdown-qwen2.5vl-7b-instruct-lora
- Demo: Coming soon (contact us for early testing)
Model Description
This LoRA adapter fine-tunes the Qwen2.5-VL-7B-Instruct vision-language model to:
- Take Arabic document images as input.
- Generate Markdown-formatted text as output, preserving tables, sections, and layout.
It was trained on ~40,000 crawled Arabic government and institutional PDFs, preprocessed into image–markdown pairs using an in-house LLM pipeline.
Uses
Direct Use
- Converting scanned Arabic documents (PDF pages, images) to structured Markdown.
- Automating document digitization and layout reconstruction for Arabic corpora.
Downstream Use
- Fine-tuning on other multilingual document types.
- Integrating into OCR + LLM pipelines for Arabic-language archives.
Out-of-Scope Use
- Any medical, legal, or critical decision-making without human verification.
- Generating content on languages or formats not covered by the training set.
⚠ Bias, Risks, and Limitations
- Markdown outputs are LLM-generated and may not perfectly match original image layouts.
- OCR/LLM hallucinations may occur on poor-quality scans or unusual document types.
- The model was trained on crawled public data, which may contain inherent biases.
We recommend human review in all production deployments.
Training Details
- Dataset: 40,000+ Arabic document pages, filtered for tabular and structured layouts.
- Preprocessing: PDFs split into individual images, paired with LLM-generated markdown annotations.
- Training framework: PEFT LoRA + TRL SFTTrainer.
- Hyperparameters:
- LoRA rank: 8
- LoRA alpha: 16
- Dropout: 0.05
- Optimizer: AdamW fused
- Batch size: 2 (with gradient accumulation 8)
- Epochs: 2
- Learning rate: 2e-4
- Train Token Size: 1024
Evaluation
- Testing data: Held-out 10% Arabic document images
- Metrics: Eval loss (best checkpoint selected)
- Results: Consistent markdown structure generation; minor hallucinations in extreme cases
Environmental Impact
- Hardware: A100 80GB GPU
- Hours used: ~5–6 hours
- Cloud provider: Presight internal cloud
- Region: UAE
- Carbon estimate: [Pending calculation]
Technical Specifications
- Model type: Vision-Language Causal LM with LoRA adapter
- Framework: PEFT 0.14.0, Transformers, TRL SFTTrainer
- Compute infra: Multi-GPU A100 cluster (single node used for this run)
How to Get Started
pip install torch==2.6.0 torchvision==0.21.0
Install requirements
accelerate==1.7.0 av==14.4.0 certifi==2025.4.26 charset-normalizer==3.4.2 einops==0.8.1 filelock==3.18.0 flash_attn==2.7.4.post1 fsspec==2025.5.1 hf-xet==1.1.2 huggingface-hub==0.32.2 idna==3.10 Jinja2==3.1.6 MarkupSafe==3.0.2 mpmath==1.3.0 networkx==3.4.2 ninja==1.11.1.4 numpy==1.26.4 nvidia-cublas-cu12==12.4.5.8 nvidia-cuda-cupti-cu12==12.4.127 nvidia-cuda-nvrtc-cu12==12.4.127 nvidia-cuda-runtime-cu12==12.4.127 nvidia-cudnn-cu12==9.1.0.70 nvidia-cufft-cu12==11.2.1.3 nvidia-curand-cu12==10.3.5.147 nvidia-cusolver-cu12==11.6.1.9 nvidia-cusparse-cu12==12.3.1.170 nvidia-cusparselt-cu12==0.6.2 nvidia-nccl-cu12==2.21.5 nvidia-nvjitlink-cu12==12.4.127 nvidia-nvtx-cu12==12.4.127 optimum==1.25.3 packaging==25.0 peft==0.14.0 pillow==11.2.1 psutil==7.0.0 PyYAML==6.0.2 qwen-vl-utils==0.0.8 regex==2024.11.6 requests==2.32.3 safetensors==0.5.3 sympy==1.13.1 tokenizers==0.21.1 tqdm==4.67.1 transformers==4.49.0 triton==3.2.0 typing_extensions==4.13.2 urllib3==2.4.0
Inference Example
import argparse
import torch
from PIL import Image
from transformers import AutoProcessor, Qwen2_5_VLForConditionalGeneration
from qwen_vl_utils import process_vision_info
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "1" # or "3"
def load_model_with_qlora(model_repo):
print(f"Loading model from: {model_repo}")
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
model_repo,
attn_implementation="flash_attention_2",
device_map="auto",
torch_dtype=torch.float16,
use_cache=True
)
processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-7B-Instruct", use_fast=True)
print("Model and processor loaded.")
return model, processor
@torch.no_grad()
def inference(model, processor, image_path, question):
image = Image.open(image_path).convert("RGB")
# Build input sample
example = {
'content': [
{'type': 'image', 'image': image},
{'type': 'text', 'text': question}
]
}
# Process image + prompt separately
image_input = process_vision_info([example])[0]
text_input = processor.apply_chat_template([example])
inputs = processor(
text=text_input,
images=image_input,
return_tensors="pt",
padding=True
).to(model.device)
outputs = model.generate(
**inputs,
max_new_tokens=1024,
do_sample=False,
pad_token_id=processor.tokenizer.pad_token_id,
eos_token_id=processor.tokenizer.eos_token_id,
use_cache=True,
num_beams=1
)
decoded = processor.decode(outputs[0], skip_special_tokens=True)
return decoded.strip()
if __name__ == "__main__":
parser = argparse.ArgumentParser(description="Qwen2.5-VL Inference Script")
parser.add_argument("--model_repo", type=str, required=True, help="Hugging Face repo, e.g., presightai/arabic-image-to-markdown-qwen2.5vl-7b-instruct-lora")
parser.add_argument("--image_path", type=str, required=True, help="Path to input image (e.g., input.jpg)")
parser.add_argument("--question", type=str, default="Extract in markdown <image>", help="Question/prompt")
args = parser.parse_args()
model, processor = load_model_with_qlora(args.model_repo)
result = inference(model, processor, args.image_path, args.question)
print("\n=== Model Output ===")
print(result)
Command
python ocr.py --model_repo presightai/arabic-image-to-markdown-qwen2.5vl-7b-instruct-lora --image_path input.jpg
Inference API and Smart Chunking to handle large token size
Due to token size limitation of small model We built a FastAPI service that:
- Automatically resizes large input images to safe dimensions.
- Chunks images horizontally to fit maximum token/image limits.
- Runs batch inference when possible or falls back to per-chunk inference.
- Cleans and merges chunk outputs into a final full-page Markdown.
Behind the Scenes
- Pads chunks to uniform size for batching.
- Uses Flash Attention and fp16 for fast decoding.
- Applies post-cleanup to strip redundant tokens, system prompts, and format artifacts.
- Downloads last month
- 49
Model tree for presightai/arabic-image-to-markdown-qwen2.5vl-7b-instruct-lora
Base model
Qwen/Qwen2.5-VL-7B-Instruct