行莫
行莫
发布于 2025-12-04 / 21 阅读
0
0

ICT&HP提升高质量图像生成的奖励模型

ICT&HP提升高质量图像生成的奖励模型

概述

在高质量图像生成领域,现有的奖励模型存在一个关键问题:它们会不恰当地给具有丰富细节和高美学价值的图像分配低分,这与人类实际的美学偏好存在显著差异。为了解决这一问题,ICTHP 项目提出了一个双组件框架,通过两个互补的奖励模型来全面评估生成图像的质量:

  • ICT (Image-Contained-Text) 奖励模型:评估图像包含文本信息的程度,而不惩罚视觉丰富性
  • HP (High-Preference) 奖励模型:纯图像模态的美学质量和人类偏好评估

这两个模型共同构成了一个完整的图像质量评估体系,为高质量图像生成提供了更准确、更符合人类偏好的评估标准。

ICT 模型:图像包含文本奖励模型

核心思想

ICT (Image-Contained-Text) 模型采用了一种新颖的对比学习目标,通过分层提示结构来评估文本-图像对齐质量。该模型的关键创新在于:它能够学习从基础提示和精炼提示-图像对中提取特征,从而缓解现有对齐指标对视觉丰富内容的偏见。

技术架构

ICT 模型基于 CLIP (Contrastive Language-Image Pre-training) 架构构建,具体来说:

  • 基础模型laion/CLIP-ViT-H-14-laion2B-s32B-b79K
  • 输入:图像和对应的文本描述
  • 输出:文本-图像相似度分数(ICT 分数)

工作原理

  1. 特征提取

    • 图像通过 CLIP 的图像编码器提取特征
    • 文本通过 CLIP 的文本编码器提取特征
    • 两种特征都进行 L2 归一化
  2. 相似度计算

    • 通过点积计算文本特征和图像特征的相似度
    • 相似度越高,表示图像越能包含文本描述的信息

代码示例

import os

os.environ['HF_ENDPOINT'] = 'https://hf-mirror.com'

import torch
from transformers import CLIPModel, CLIPProcessor
from PIL import Image
from hpsv2.src.open_clip import get_tokenizer

# load model
if torch.cuda.is_available():
    device = "cuda"
else:
    device = "cpu"
print(f"Using device: {device}")

processor_name_or_path = "laion/CLIP-ViT-H-14-laion2B-s32B-b79K"
model_pretrained_name_or_path = './models/8y/ICT'

processor = CLIPProcessor.from_pretrained(processor_name_or_path, cache_dir="./models/")
preprocess_val = lambda img: processor(images=img, return_tensors="pt")["pixel_values"]

# Load ICT model
ict_model = CLIPModel.from_pretrained(processor_name_or_path, cache_dir="./models/")
checkpoint_path = f"{model_pretrained_name_or_path}/pytorch_model.bin"
state_dict = torch.load(checkpoint_path, map_location="cpu")
ict_model.load_state_dict(state_dict, strict=False)
ict_model = ict_model.to(device)
ict_model.eval()

# Get tokenizer
tokenizer = get_tokenizer('ViT-H-14')


def calc_ict_scores(images, texts):
    # preprocess
    image_scores = []
    for image in images:
        image_score = preprocess_val(image).to(device)
        image_scores.append(image_score)

    ict_scores = []

    with torch.no_grad():
        for image_score, text in zip(image_scores, texts):
            # extract features
            image_ict_features = ict_model.get_image_features(pixel_values=image_score)
            image_ict_features = image_ict_features / image_ict_features.norm(dim=-1, keepdim=True)

            # process text input
            text_input_ids = tokenizer(text).to(device)
            text_features_ict = ict_model.get_text_features(text_input_ids)
            text_features_ict = text_features_ict / text_features_ict.norm(dim=-1, keepdim=True)

            # calculate ICT scores
            ict_score = text_features_ict @ image_ict_features.T
            ict_scores.append(ict_score.cpu().item() if ict_score.dim() == 0 else ict_score.cpu().squeeze().item())

    return ict_scores


pil_images = [Image.open("./images/image1.png"), Image.open("./images/image2.png"), Image.open("./images/image3.jpg")]
texts = [
    'Nativity scene in a vibrant stained glass art style, featuring Mary, Joseph, and Baby Jesus nestled in a radiant manger, beneath a luminous star of Bethlehem. The holy family is adorned in flowing robes of deep sapphire blue and pristine ivory, accented with shimmering gold details that catch the light. In the foreground, a gentle lamb gazes up, symbolizing peace and purity. The night sky above is a dynamic gradient of rich indigo and twilight purple, intricately scattered with twinkling silver stars, enhancing the celestial ambiance. The text “COME LET US” elegantly arches in the upper left, crafted in ornate golden script, surrounded by lush evergreen pine branches delicately frosted with soft white snowflakes. Vibrant red cardinal birds perch amongst the foliage, adding splashes of color. The entire composition is enriched with intricate glass patterns, where geometric fragments interact with radiant beams of light emanating from the star, casting a heavenly glow upon the scene. The traditional biblical setting of the stable is transformed into a warm, inviting sanctuary, each element harmoniously blended to evoke a profound sense of wonder and reverence in this Christian Christmas theme. The rounded 3D style brings depth and dimension, making the figures and elements appear lifelike and engaging in this striking stained glass masterpiece.',
    'Nativity scene in a rounded 3D stained glass art style, featuring Mary and Joseph tenderly gazing at Baby Jesus nestled in the manger, with an iridescent bright star of Bethlehem above casting luminous beams. A fluffy lamb stands playfully in the foreground, accentuating their gentle surroundings. Mary is adorned in flowing robes of deep royal blue and soft silver, while Joseph complements her in rich emerald greens with golden tones that catch the light. The night sky is a deep twilight violet speckled with shimmering silver stars, creating a serene backdrop. In the upper left corner, the words "COME LET US" are elegantly etched in gold, surrounded by lush, emerald pine branches and vibrant red cardinal birds that add a splash of color. Delicate snowflakes drift gracefully through the scene, enhancing the wintry feel. The stained glass is adorned with intricate patterns of vibrant colors, with blue dominating harmoniously alongside deep reds and greens, forming a captivating mosaic. Radiant light beams burst forth from the star, illuminating the holy family in their stable setting, surrounded by soft golden accents. This traditional biblical scene comes alive with colorful geometric glass fragments, all harmoniously arranged to create a divine and immersive visual experience.',
    '温馨 This is a vibrant photograph capturing two young white girls standing side by side outdoors on a sunny day. The girl on the left has light skin and is wearing large, round glasses, a light long-sleeve shirt, and jeans. She has her hair tied back and is making a peace sign with her left hand. The girl on the right has slightly darker skin and is dressed in a cream-colored long-sleeve shirt and jeans. She has her hair pulled back into a ponytail and is smiling broadly, also making a peace sign with her right hand. Both girls are standing in front of a white table with a glass jar containing a small fish in clear water. The background shows a lush, green grassy field with trees and a colorful, tassel-like decoration hanging above. The overall atmosphere is cheerful and lively.'
]

scores = calc_ict_scores(pil_images, texts)
print(f"ICT Scores: {scores}")

优势特点

  • 无偏见评估:不会因为图像视觉内容丰富而降低分数
  • 分层学习:能够处理从简单到复杂的各种提示
  • 语义对齐:准确评估图像是否真正包含了文本描述的内容

HP 模型:高偏好奖励模型

核心思想

HP (High-Preference) 模型代表了一种范式转变:它完全基于图像模态进行评估,不依赖文本输入。当文本-图像对齐达到饱和(ICT 分数接近 1)时,HP 模型能够继续基于美学和感知因素区分图像质量,这些因素对人类观察者至关重要。

技术架构

HP 模型由两个主要组件构成:

  1. CLIP 骨干网络:用于提取图像特征

    • 基于 laion/CLIP-ViT-H-14-laion2B-s32B-b79K
    • 输出 1024 维的图像特征向量
  2. 多层感知机 (MLP) 评分器

    • 输入:1024 维图像特征
    • 结构:1024 → 1024 → 128 → 64 → 16 → 1
    • 使用 Dropout 防止过拟合
    • 输出通过 Sigmoid 激活函数映射到 [0, 1] 区间

工作原理

  1. 特征提取

    • 图像通过 CLIP 骨干网络提取 1024 维特征向量
  2. 偏好预测

    • 特征向量通过 MLP 评分器
    • 输出经过 Sigmoid 函数,得到 0-1 之间的偏好分数

代码示例

# import
import os

os.environ['HF_ENDPOINT'] = 'https://hf-mirror.com'

import torch
from transformers import CLIPModel, CLIPProcessor
from PIL import Image
import torch.nn as nn


class MLP(nn.Module):
    def __init__(self):
        super().__init__()
        self.layers = nn.Sequential(
            nn.Linear(1024, 1024), nn.Dropout(0.2),
            nn.Linear(1024, 128), nn.Dropout(0.2),
            nn.Linear(128, 64), nn.Dropout(0.1),
            nn.Linear(64, 16), nn.Linear(16, 1)
        )

    def forward(self, x):
        return self.layers(x)


# load model
if torch.cuda.is_available():
    device = "cuda"
else:
    device = "cpu"
print(f"Using device: {device}")

processor_name_or_path = "laion/CLIP-ViT-H-14-laion2B-s32B-b79K"
model_pretrained_name_or_path = "./models/8y/HP"

processor = CLIPProcessor.from_pretrained(processor_name_or_path, cache_dir="./models/")
backbone = CLIPModel.from_pretrained(model_pretrained_name_or_path, subfolder="hp_backbone",
                                     cache_dir="./models/8y/HP/hp_backbone").eval().to(device)
scorer = MLP()
scorer.load_state_dict(
    torch.load(f"{model_pretrained_name_or_path}/hp_scorer/mlp_pytorch_model.bin", map_location='cpu'))
scorer = scorer.eval().to(device)


def calc_hp_scores(images):
    # preprocess
    image_inputs = processor(
        images=images,
        return_tensors="pt"
    ).to(device)

    with torch.no_grad():
        # extract features
        image_features = backbone.get_image_features(**image_inputs)

        # calculate hp scores
        hp_scores = torch.sigmoid(scorer(image_features))

    return hp_scores.cpu().squeeze().tolist()


pil_images = [Image.open("./images/image1.png"), Image.open("./images/image2.png"), Image.open("./images/image3.jpg"),
              Image.open("./images/Mona-Lisa-Smile.jpg"), Image.open('./images/image4.jpg'),
              Image.open('./images/image5.jpg'),
              Image.open('./images/image6.jpg'), Image.open('./images/image7.jpg'), Image.open('./images/image8.jpg'),
              Image.open('./images/image9.jpg'), Image.open('./images/image10.jpg'), Image.open('./images/image11.jpg'),
              Image.open('./images/image12.jpg'), Image.open('./images/image13.jpg'),
              Image.open('./images/image14.jpg'), Image.open('./images/image15.jpg'),
              Image.open('./images/image16.jpg')]
scores = calc_hp_scores(pil_images)

print("HP Scores:")
for i in range(0, len(scores), 4):
    row = scores[i:i+4]
    print("  " + "  ".join(f"{score:.4f}" for score in row))

优势特点

  • 纯视觉评估:不需要文本输入,专注于图像本身的美学质量
  • 人类偏好对齐:在 Pick-High 数据集和 Pick-a-Pic 数据集上训练,学习人类真实偏好
  • 正交评估:提供与 ICT 模型互补的质量评估维度

双模型协同工作

互补关系

ICT 和 HP 模型形成了完美的互补关系:

  1. ICT 模型:评估"图像是否准确反映了文本描述"

    • 关注语义对齐
    • 适用于文本到图像生成任务的质量控制
  2. HP 模型:评估"图像是否具有高质量的美学价值"

    • 关注视觉美学
    • 适用于纯图像质量评估

使用场景

  • 图像生成优化:结合两个模型的分数,可以同时优化语义准确性和美学质量
  • 图像筛选:使用 ICT 确保语义正确,使用 HP 确保美学质量
  • 质量评估:提供多维度的图像质量评估体系

训练数据:Pick-High 数据集

两个模型都使用了 Pick-High 数据集进行训练,这是一个高质量的数据集:

  • 规模:包含 360,000 张图像
  • 来源:使用 SD3.5-Large 生成,提示词通过 Claude-3.5-Sonnet 的链式推理精炼
  • 标注:结合 Pick-a-Pic 形成图像三元组,包含全面的偏好标注
  • 结构
    • pick_easy_img/:基础质量图像
    • pick_refine_img/:高质量精炼图像

模型获取与使用

模型下载

两个模型都可以从 Hugging Face 获取:

环境要求

  • Python 3.8+
  • PyTorch 1.12+
  • CUDA 兼容的 GPU(推荐 8GB+ VRAM)
  • 16GB+ RAM(用于训练)

快速开始

  1. 安装依赖
pip install torch transformers pillow
  1. 下载模型
# 使用 huggingface-cli
huggingface-cli download 8y/ICT
huggingface-cli download 8y/HP
  1. 运行示例代码
    • 参考 ict_example.py 使用 ICT 模型
    • 参考 hp_example.py 使用 HP 模型

应用案例

案例 1:图像生成质量控制

在文本到图像生成任务中,可以同时使用两个模型:

# 生成图像后
ict_score = calc_ict_scores([generated_image], [prompt])
hp_score = calc_hp_scores([generated_image])

# 综合评估
if ict_score[0] > 0.7 and hp_score[0] > 0.6:
    print("高质量图像:语义准确且美学价值高")

案例 2:图像排序

对多张候选图像进行排序:

images = [img1, img2, img3, ...]
prompts = [prompt] * len(images)

ict_scores = calc_ict_scores(images, prompts)
hp_scores = calc_hp_scores(images)

# 综合分数(可根据需求调整权重)
combined_scores = [0.5 * ict + 0.5 * hp for ict, hp in zip(ict_scores, hp_scores)]
sorted_indices = sorted(range(len(combined_scores)), key=lambda i: combined_scores[i], reverse=True)

技术细节

ICT 模型技术要点

  • Tokenizer:使用 hpsv2.src.open_clipViT-H-14 tokenizer
  • 特征归一化:L2 归一化确保特征在单位球面上
  • 相似度计算:点积计算,结果范围通常在 [-1, 1]

HP 模型技术要点

  • MLP 架构:深层网络结构,逐步降维
  • Dropout 策略:不同层使用不同的 Dropout 率(0.2, 0.2, 0.1)
  • 激活函数:最终使用 Sigmoid 确保输出在 [0, 1] 区间

论文与引用

该工作发表于 ICCV 2025,论文标题为 "Enhancing Reward Models for High-quality Image Generation: Beyond Text-Image Alignment"。

如果使用本工作,请引用:

@misc{ba2025enhancingrewardmodelshighquality,
      title={Enhancing Reward Models for High-quality Image Generation: Beyond Text-Image Alignment}, 
      author={Ying Ba and Tianyu Zhang and Yalong Bai and Wenyi Mo and Tao Liang and Bing Su and Ji-Rong Wen},
      year={2025},
      eprint={2507.19002},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2507.19002}, 
}

总结

ICT 和 HP 模型为高质量图像生成提供了一个全面的评估框架。ICT 模型确保生成的图像准确反映文本描述,而 HP 模型确保图像具有高美学价值。两者结合使用,可以显著提升图像生成任务的质量和用户满意度。

通过这两个互补的奖励模型,研究者和开发者可以:

  • 更准确地评估生成图像的质量
  • 优化图像生成模型的训练过程
  • 构建更符合人类偏好的图像生成系统

相关资源


评论