YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

eagle0504/llava-video-text-model

Fine-tuned LLaVA model on video-text data using DeepSpeed.

Model Details

  • Base model: llava-hf/llava-interleave-qwen-7b-hf
  • Architecture: LLaVA (Large Language and Vision Assistant)
  • Training samples: 4 videos
  • Training: Multi-GPU with DeepSpeed ZeRO Stage 2
  • Task: Video-text conversation generation
  • Video frames: 5 frames per video

Usage

import requests
from PIL import Image
import torch
from transformers import AutoProcessor, LlavaForConditionalGeneration

# Load model and processor
processor = AutoProcessor.from_pretrained("eagle0504/llava-video-text-model")
model = LlavaForConditionalGeneration.from_pretrained(
    "eagle0504/llava-video-text-model",
    torch_dtype=torch.float16,
    low_cpu_mem_usage=True,
).to(0)

# Define conversation with multiple images for video
conversation = [
    {
        "role": "user",
        "content": [
            {"type": "text", "text": "What is in this video?"},
            {"type": "image"},
            {"type": "image"},
            {"type": "image"},
            {"type": "image"},
            {"type": "image"},
        ],
    },
]

prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)

# Process video frames (you need to extract frames from your video)
video_frames = [...]  # List of PIL Images from video
inputs = processor(images=video_frames, text=prompt, return_tensors='pt').to(0, torch.float16)

# Generate response
output = model.generate(**inputs, max_new_tokens=200, do_sample=False)
response = processor.decode(output[0], skip_special_tokens=True)
print(response)

Training Configuration

  • DeepSpeed ZeRO Stage 2
  • Mixed precision (BF16)
  • AdamW optimizer
  • Learning rate: 5e-5
  • Video frames per sample: 5

Video Processing

This model expects 5 frames extracted from each video. For best results:

  1. Extract evenly spaced frames from your video
  2. Resize frames to model's expected input size
  3. Pass frames as a list to the processor
Downloads last month
19
Safetensors
Model size
8B params
Tensor type
BF16
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support