Polos: Multimodal Metric Learning from Human Feedback for Image Captioning

Yuiga Wada , Kanta Kaneda , Daichi Saito , Komei Sugiura

CVPR 2024
Highlight (top3.6%)

hero image


Establishing an automatic evaluation metric that closely aligns with human judgments is essential for effectively developing image captioning models. Recent data-driven metrics have demonstrated a stronger correlation with human judgments than classic metrics such as CIDEr; however they lack sufficient capabilities to handle hallucinations and generalize across diverse images and texts partially because they compute scalar similarities merely using embeddings learned from tasks unrelated to image captioning evaluation. In this study, we propose Polos, a supervised automatic evaluation metric for image captioning models. Polos computes scores from multimodal inputs, using a parallel feature extraction mechanism that leverages embeddings trained through large-scale contrastive learning. To train Polos, we introduce Multimodal Metric Learning from Human Feedback (M$^2$LHF), a framework for developing metrics based on human feedback. We constructed the Polaris dataset, which comprises 131K human judgments from 550 evaluators, which is approximately ten times larger than standard datasets. Our approach achieved state-of-the-art performance on Composite, Flickr8K-Expert, Flickr8K-CF, PASCAL-50S, FOIL, and the Polaris dataset, thereby demonstrating its effectiveness and robustness.


> Usage (How to evaluate your model?)

1. Install Polos.

pip install polos

2. Add the following code:

from polos.models import download_model, load_checkpoint
from PIL import Image

model_path = download_model("polos")
model = load_checkpoint(model_path)

# Data must be in the following format:
data = [
        "img": Image.open("test.png").convert("RGB"),
        "mt": "a dog with a person",
        "refs":["there is a dog sitting on a couch with a person reaching out", "a dog laying on a couch with a person", 'a dog is laying on a couch with a person'],

_, scores = model.predict(data, batch_size=8, cuda=True)

Fig 2. Overview of the proposed metric. In alignment with the principles of $\mathrm{M^2LHF}$, Polos computes the evaluation $\hat{y}$ based on multimodal inputs and regresses the human evaluation. The proposed metric extracts effective features for caption evaluation using the difference and Hadamard product of features derived from both CLIP and RoBERTa.

Polaris Dataset

The Polaris dataset offers a large-scale, diverse benchmark for evaluating metrics for image captioning, surpassing existing datasets in terms of size, caption diversity, number of human judgments, and granularity of the evaluations. It includes 131,020 generated captions and 262,040 reference captions. The generated captions have a vocabulary of 3,154 unique words and the reference captions have a vocabulary of 22,275 unique words.


title     = {{Polos: Multimodal Metric Learning from Human Feedback for Image Captioning}},
author    = {Wada, Yuiga  and Kaneda, Kanta and Saito, Daichi and Sugiura, Komei},
year      = 2024,
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},