Model Card for ICT (Image-Contained-Text) Model

This model is a specialized text-image alignment evaluation function that quantifies the extent to which an image contains textual information, without penalizing high-quality images with rich visual details. See our paper Enhancing Reward Models for High-quality Image Generation: Beyond Text-Image Alignment for more details.

Model Details

Model Description

The ICT (Image-Contained-Text) model represents a paradigm shift in text-image alignment evaluation by addressing the fundamental flaw in existing reward models. When traditional metrics like CLIP Score inappropriately assign low scores to images with rich details and high aesthetic value, the ICT model continues to properly evaluate text-image alignment based on how well an image contains textual content.

Key Features

Threshold-Based Evaluation: Uses adaptive threshold mechanism instead of direct similarity scoring
Human Preference Aligned: Trained on hierarchical preference triplets from Pick-High dataset and Pick-a-pic dataset
Complementary Design: Works optimally when combined with HP model for comprehensive evaluation

Model Sources

Repository: https://github.com/BarretBa/ICTHP
Paper: Enhancing Reward Models for High-quality Image Generation: Beyond Text-Image Alignment
Base Model: CLIP-ViT-H-14 (Fine-tuned with ICT objectives)
Training Dataset: Pick-High dataset and Pick-a-pic dataset (360,000 preference triplets)

How to Get Started with the Model

Installation

pip install torch transformers pillow numpy open-clip-torch

Quick Start

# import
import torch
from transformers import CLIPModel, CLIPProcessor
from PIL import Image
from hpsv2.src.open_clip import get_tokenizer

# load model
device = "cuda"
processor_name_or_path = "laion/CLIP-ViT-H-14-laion2B-s32B-b79K"
model_pretrained_name_or_path = "8y/ICT"

processor = CLIPProcessor.from_pretrained(processor_name_or_path)
preprocess_val = lambda img: processor(images=img, return_tensors="pt")["pixel_values"]

# Load ICT model
ict_model = CLIPModel.from_pretrained(processor_name_or_path)
checkpoint_path = f"{model_pretrained_name_or_path}/pytorch_model.bin"
state_dict = torch.load(checkpoint_path, map_location="cpu")
ict_model.load_state_dict(state_dict, strict=False)
ict_model = ict_model.to(device)
ict_model.eval()

# Get tokenizer
tokenizer = get_tokenizer('ViT-H-14')

def calc_ict_scores(images, texts):
    # preprocess
    image_scores = []
    for image in images:
        image_score = preprocess_val(image).to(device)
        image_scores.append(image_score)
    
    ict_scores = []
    
    with torch.no_grad():
        for image_score, text in zip(image_scores, texts):
            # extract features
            image_ict_features = ict_model.get_image_features(pixel_values=image_score)
            image_ict_features = image_ict_features / image_ict_features.norm(dim=-1, keepdim=True)
            
            # process text input
            text_input_ids = tokenizer(text).to(device)
            text_features_ict = ict_model.get_text_features(text_input_ids)
            text_features_ict = text_features_ict / text_features_ict.norm(dim=-1, keepdim=True)
            
            # calculate ICT scores
            ict_score = text_features_ict @ image_ict_features.T
            ict_scores.append(ict_score.cpu().item() if ict_score.dim() == 0 else ict_score.cpu().squeeze().item())
    
    return ict_scores

pil_images = [Image.open("image1.jpg"), Image.open("image2.jpg")]
texts = ["prompt for image1", "prompt for image2"]
scores = calc_ict_scores(pil_images, texts)
print(f"ICT Scores: {scores}")

Training Details

Training Objective

ICT Scoring Framework: Instead of direct CLIP similarity, ICT employs a threshold-based mechanism:

C(I,P) = min(CLIP(I,P) / θ, 1)

ICT models are trained using hierarchical scoring with triplet rankings:

E₃ = 1: High-quality images with refined prompts
E₂ = C(I₂, P_easy): Medium-quality images
E₁ = min(C(I₁, P_easy), E₂): Low-quality images

Training Data

This model was trained on 360,000 preference triplets from Pick-High dataset and Pick-a-pic dataset.

Citation

@misc{ba2025enhancingrewardmodelshighquality,
      title={Enhancing Reward Models for High-quality Image Generation: Beyond Text-Image Alignment}, 
      author={Ying Ba and Tianyu Zhang and Yalong Bai and Wenyi Mo and Tao Liang and Bing Su and Ji-Rong Wen},
      year={2025},
      eprint={2507.19002},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2507.19002}, 
}