JaSPICE: Automatic Evaluation Metric Using
Predicate-Argument Structures for Image Captioning Models

Yuiga Wada , Kanta Kaneda , Komei Sugiura
Keio University

Image captioning studies have applications in various fields including robotics. They rely heavily on automatic evaluation metrics such as BLEU and METEOR. However, such n-gram-based metrics have been shown to correlate poorly with human evaluation, leading to the proposal of alternative metrics such as SPICE for English; however, no equivalent metrics have been established for other languages. Therefore, in this study, we propose an automatic evaluation metric called JaSPICE, which evaluates Japanese captions based on scene graphs. The proposed method generates a scene graph from dependencies and the predicate-argument structure, and extends the graph using synonyms. We conducted experiments on STAIR Captions and PFN-PIC and our metric outperformed the baseline metrics (BLEU, ROUGE, METEOR, CIDEr, and SPICE) for the correlation coefficient with the human evaluation.
The proposed method consists of two main modules: PAS-Based Scene Graph Parser (PAS-SGP) and Graph Analyzer (GA). (i) PAS-SGP generates scene graphs from captions using the predicate-argument structure (PAS) and dependencies. (ii) GA performs a graph extension using synonym relationships and then computes the F1 score by matching tuples extracted from the candidate and the reference scene graphs. JaSPICE is easily interpretable because it outputs the score in the range of [0, 1].

Fig 1. Example of an image and corresponding scene graph. The pink, green, and light blue nodes represent objects, attributes, and relationships, respectively, and the arrows represent dependencies.

The caption is “hitodo ̄ri no sukunaku natta do ̄ro de, aoi zubon o kita otokonoko ga orenji-iro no herumetto o kaburi, suke ̄tobo ̄do ni notte iru.”
(“on a deserted street, a boy in blue pants and an orange helmet rides a skateboard.”)

Results Overview

Table Ⅰ. Correlation coefficients between each automatic evaluation metric and
the human evaluation for STAIR Captions [Yoshikawa+, ACL17].

Metric Pearson Spearman Kendall
BLEU 0.296 0.343 0.260
ROUGE 0.366 0.340 0.258
METEOR 0.345 0.366 0.279
CIDER 0.312 0.355 0.269
JaSPICE 0.501 0.529 0.413

Table Ⅱ. Correlation coefficients between each automatic evaluation metric and
the human evaluation for PFN-PIC [Hatori+, ICRA18].

Metric Pearson Spearman Kendall
BLEU 0.484 0.466 0.352
ROUGE 0.500 0.474 0.365
METEOR 0.423 0.457 0.352
CIDER 0.416 0.462 0.353
JaSPICE 0.572 0.587 0.452

Table Ⅲ. Results of the ablation study

Condition Parser Graph
Pearson Spearman Kendall M
(i) UD 0.398 0.390 0.309 1465
(ii) UD 0.399 0.390 0.309 1430
(iii) JaSGP 0.493 0.524 0.410 1417
Our Metric JaSGP 0.501 0.529 0.413 1346
will be available upon acceptance of this paper

1. Download and build docker image.

git clone [email protected]:keio-smilab23/JaSPICE.git
pip install -e .
docker build -t jaspice .
docker run -d -p 2115:2115 jaspice

2. Add the following code. (like pycocoevalcap.)

from jaspice.api import JaSPICE

batch_size = 16
jaspice = JaSPICE(batch_size,server_mode=True)
_, score = jaspice.compute_score(references, candidates)

[Magassouba+, CoRL19] A. Magassouba et al., “Multimodal Attention Branch Network for Perspective-Free Sentence Generation,” in CoRL, 2019, pp. 76–85.
[Ogura+, RAL20] T. Ogura, et al., “Alleviating the Burden of Labeling: Sentence Generation by Attention Branch Encoder- Decoder Network,” IEEE RAL, vol. 5, no. 4, pp. 5945–5952, 2020.
[Kambara+, IROS21] M. Kambara and K. Sugiura, “Case Relation Transformer: A Cross- modal Language Generation Model for Fetching Instructions,” IROS, 2021.
[Yoshikawa+, ACL17] Y. Yoshikawa et al., “STAIR Captions: Constructing a Large- Scale Japanese Image Caption Dataset,” in ACL, 2017, pp. 417–421.
[Hatori+, ICRA18] J. Hatori, et al., “Interactively Picking Real-World Objects with Unconstrained Spoken Language Instructions,” in ICRA, 2018, pp. 3774–3781.