PEFT
Safetensors

Arabic Image-to-Markdown LoRA Adapter for Qwen2.5-VL-7B-Instruct


Model Card for Model ID

presightai/arabic-image-to-markdown-qwen2.5vl-7b-instruct-lora


Model Details

  • Base Model: Qwen/Qwen2.5-VL-7B-Instruct
  • Library: PEFT (LoRA adapters)
  • Developed by: Presight AI Technologies LLC
  • Finetuned from: Qwen/Qwen2.5-VL-7B-Instruct
  • Languages: Arabic (primary), English (secondary instructions)
  • License: Apache 2.0 (inherits from base model)
  • Task: Image-to-Markdown conversion on Arabic document images

Model Sources


Model Description

This LoRA adapter fine-tunes the Qwen2.5-VL-7B-Instruct vision-language model to:

  • Take Arabic document images as input.
  • Generate Markdown-formatted text as output, preserving tables, sections, and layout.

It was trained on ~40,000 crawled Arabic government and institutional PDFs, preprocessed into image–markdown pairs using an in-house LLM pipeline.


Uses

Direct Use

  • Converting scanned Arabic documents (PDF pages, images) to structured Markdown.
  • Automating document digitization and layout reconstruction for Arabic corpora.

Downstream Use

  • Fine-tuning on other multilingual document types.
  • Integrating into OCR + LLM pipelines for Arabic-language archives.

Out-of-Scope Use

  • Any medical, legal, or critical decision-making without human verification.
  • Generating content on languages or formats not covered by the training set.

⚠ Bias, Risks, and Limitations

  • Markdown outputs are LLM-generated and may not perfectly match original image layouts.
  • OCR/LLM hallucinations may occur on poor-quality scans or unusual document types.
  • The model was trained on crawled public data, which may contain inherent biases.

We recommend human review in all production deployments.


Training Details

  • Dataset: 40,000+ Arabic document pages, filtered for tabular and structured layouts.
  • Preprocessing: PDFs split into individual images, paired with LLM-generated markdown annotations.
  • Training framework: PEFT LoRA + TRL SFTTrainer.
  • Hyperparameters:
    • LoRA rank: 8
    • LoRA alpha: 16
    • Dropout: 0.05
    • Optimizer: AdamW fused
    • Batch size: 2 (with gradient accumulation 8)
    • Epochs: 2
    • Learning rate: 2e-4
    • Train Token Size: 1024

Evaluation

  • Testing data: Held-out 10% Arabic document images
  • Metrics: Eval loss (best checkpoint selected)
  • Results: Consistent markdown structure generation; minor hallucinations in extreme cases

Environmental Impact

  • Hardware: A100 80GB GPU
  • Hours used: ~5–6 hours
  • Cloud provider: Presight internal cloud
  • Region: UAE
  • Carbon estimate: [Pending calculation]

Technical Specifications

  • Model type: Vision-Language Causal LM with LoRA adapter
  • Framework: PEFT 0.14.0, Transformers, TRL SFTTrainer
  • Compute infra: Multi-GPU A100 cluster (single node used for this run)

How to Get Started

pip install torch==2.6.0 torchvision==0.21.0

Install requirements

accelerate==1.7.0 av==14.4.0 certifi==2025.4.26 charset-normalizer==3.4.2 einops==0.8.1 filelock==3.18.0 flash_attn==2.7.4.post1 fsspec==2025.5.1 hf-xet==1.1.2 huggingface-hub==0.32.2 idna==3.10 Jinja2==3.1.6 MarkupSafe==3.0.2 mpmath==1.3.0 networkx==3.4.2 ninja==1.11.1.4 numpy==1.26.4 nvidia-cublas-cu12==12.4.5.8 nvidia-cuda-cupti-cu12==12.4.127 nvidia-cuda-nvrtc-cu12==12.4.127 nvidia-cuda-runtime-cu12==12.4.127 nvidia-cudnn-cu12==9.1.0.70 nvidia-cufft-cu12==11.2.1.3 nvidia-curand-cu12==10.3.5.147 nvidia-cusolver-cu12==11.6.1.9 nvidia-cusparse-cu12==12.3.1.170 nvidia-cusparselt-cu12==0.6.2 nvidia-nccl-cu12==2.21.5 nvidia-nvjitlink-cu12==12.4.127 nvidia-nvtx-cu12==12.4.127 optimum==1.25.3 packaging==25.0 peft==0.14.0 pillow==11.2.1 psutil==7.0.0 PyYAML==6.0.2 qwen-vl-utils==0.0.8 regex==2024.11.6 requests==2.32.3 safetensors==0.5.3 sympy==1.13.1 tokenizers==0.21.1 tqdm==4.67.1 transformers==4.49.0 triton==3.2.0 typing_extensions==4.13.2 urllib3==2.4.0

Inference Example

import argparse
import torch
from PIL import Image
from transformers import AutoProcessor, Qwen2_5_VLForConditionalGeneration
from qwen_vl_utils import process_vision_info

import os
os.environ["CUDA_VISIBLE_DEVICES"] = "1"  # or "3"

def load_model_with_qlora(model_repo):
    print(f"Loading model from: {model_repo}")
    model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
        model_repo,
        attn_implementation="flash_attention_2",
        device_map="auto",
        torch_dtype=torch.float16,
        use_cache=True
    )
    processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-7B-Instruct", use_fast=True)
    print("Model and processor loaded.")
    return model, processor

@torch.no_grad()
def inference(model, processor, image_path, question):
    image = Image.open(image_path).convert("RGB")

    # Build input sample
    example = {
        'content': [
            {'type': 'image', 'image': image},
            {'type': 'text', 'text': question}
        ]
    }

    # Process image + prompt separately
    image_input = process_vision_info([example])[0]
    text_input = processor.apply_chat_template([example])

    inputs = processor(
        text=text_input,
        images=image_input,
        return_tensors="pt",
        padding=True
    ).to(model.device)

    outputs = model.generate(
        **inputs,
        max_new_tokens=1024,
        do_sample=False,
        pad_token_id=processor.tokenizer.pad_token_id,
        eos_token_id=processor.tokenizer.eos_token_id,
        use_cache=True,
        num_beams=1
    )

    decoded = processor.decode(outputs[0], skip_special_tokens=True)
    return decoded.strip()

if __name__ == "__main__":
    parser = argparse.ArgumentParser(description="Qwen2.5-VL Inference Script")
    parser.add_argument("--model_repo", type=str, required=True, help="Hugging Face repo, e.g., presightai/arabic-image-to-markdown-qwen2.5vl-7b-instruct-lora")
    parser.add_argument("--image_path", type=str, required=True, help="Path to input image (e.g., input.jpg)")
    parser.add_argument("--question", type=str, default="Extract in markdown <image>", help="Question/prompt")

    args = parser.parse_args()

    model, processor = load_model_with_qlora(args.model_repo)
    result = inference(model, processor, args.image_path, args.question)

    print("\n=== Model Output ===")
    print(result)

Command

   python ocr.py --model_repo presightai/arabic-image-to-markdown-qwen2.5vl-7b-instruct-lora   --image_path input.jpg

Inference API and Smart Chunking to handle large token size

Due to token size limitation of small model We built a FastAPI service that:

  • Automatically resizes large input images to safe dimensions.
  • Chunks images horizontally to fit maximum token/image limits.
  • Runs batch inference when possible or falls back to per-chunk inference.
  • Cleans and merges chunk outputs into a final full-page Markdown.

Behind the Scenes

  • Pads chunks to uniform size for batching.
  • Uses Flash Attention and fp16 for fast decoding.
  • Applies post-cleanup to strip redundant tokens, system prompts, and format artifacts.
Downloads last month
49
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for presightai/arabic-image-to-markdown-qwen2.5vl-7b-instruct-lora

Adapter
(79)
this model

Dataset used to train presightai/arabic-image-to-markdown-qwen2.5vl-7b-instruct-lora