llava-v1.5-7b-gpt4OCR
Collection
Collection of checkpoints for llava-v1.5-7b fine-tuned for OCR tasks.
•
5 items
•
Updated
AWQ quant for nnethercott/llava-v1.5-7b-gpt4OCR-hf
. autoawq
quantization config in files.
The two datasets used for fine tuning are:
We use 10k samples from GRIT where each sample has an image-caption CLIP similarity larger than 0.35 and where the caption does not contain any proper nouns (filtered using spaCy).
Use the code below to get started with the model:
from transformers import (
AutoProcessor,
)
from awq import AutoAWQForCausalLM
import time
import requests
from PIL import Image
import torch
awq_model_id = "/home/nathaniel/models/llava/llava-v1.5-7b-ocr-pretrain-hf-AWQ"
processor = AutoProcessor.from_pretrained(awq_model_id)
model = AutoAWQForCausalLM.from_quantized(awq_model_id, safetensors=True, device_map={"": 0}, fuse_layers=False)
image = "https://adquick-public.imgix.net/landing+images/media_formats/billboard-carvana.png?auto=format"
prompt = "USER:<image>/ngenerate a descriptive caption for this image. ASSISTANT: "
image = Image.open(requests.get(image_file, stream=True).raw).convert("RGB")
with torch.no_grad():
inputs = processor(prompt, image, return_tensors = 'pt').to(0, torch.float16)
start = time.perf_counter()
out = model.generate(
**inputs,
**generation_kwargs,
)
stop = time.perf_counter()
print(processor.tokenizer.batch_decode(out[:,len(processor.tokenizer.encode(args.prompt)):], skip_special_tokens = True)[0])
print(f'generation speed: {round(len(out[0])/(stop-start), 1)} [t/s]')
Output for nnethercott/llava-v1.5-7b-gpt4OCR-hf-AWQ
:
The image captures a Carvana billboard under a clear blue sky, showcasing a red sports car being towed by a white Carvana truck. The billboard prominently features the Carvana logo and the slogan "Buy your next car from your couch.