Zina: Multimodal Fine-grained Hallucination Detection and Editing

Under Review

Currently, we are providing the dataset and code as supplementary materials.

Abstract

Multimodal Large Language Models (MLLMs) often generate hallucinations, where the output deviates from the visual content. Given that these hallucinations can take diverse forms, detecting hallucinations at a fine-grained level is essential for comprehensive evaluation and analysis. To this end, we propose a novel task of multimodal fine-grained hallucination detection and editing for MLLMs. To tackle this task, we propose ZINA, a novel method that identifies hallucinated spans, classifies their error types into six categories, and suggests appropriate refinements. We trained ZINA on 20k synthetic samples generated via a new graph-based approach that captures the dependencies among errors. Moreover, we constructed the VisionHall dataset, which contains approximately 6.9k outputs generated by twelve MLLMs, with hallucinated spans manually annotated. We demonstrated that ZINA outperformed existing methods, including GPT-4o and LLama-3.2, in both detection and editing tasks.

Zina eye-catch image

Overview of the proposed task. In contrast to conventional tasks, the model is expected to detect hallucinated spans, classify their types based on a taxonomy, and suggest appropriate refinements.

asss

Synthetic Training Data Curation

Zina eye-catch image

Overview of the graph-based synthetic data generation process. We first obtain seed descriptions by leveraging various MLLMs. Subsequently, the Error Insertion module injects errors while considering inter-span dependencies of errors. The Graph-based Augmentation module then constructs a DAG and prunes it to generate diverse training samples.

Quantitative Comparison

Quantitative comparison with baseline methods. Bold font indicates the best, and underlined font indicates the second best.

Model Detection Editing
F1 BERT-F1 CLIP-F1 CLIP-S PAC-S
LLaVA-1.5-7B 0.82 0.66 0.93 64.01 72.72
Qwen2-VL-7B 3.36 3.62 4.98 64.79 73.01
LLaVA-OV-Qwen2-7B 3.39 3.39 3.39 64.06 72.40
LLaVA-v1.5-13B 4.73 5.08 6.71 64.74 73.02
LLaVA-NeXT-Qwen-32B 19.09 24.29 31.06 65.34 73.47
Llama-3.2-90B-Vision-Instruct 16.92 14.56 17.62 65.28 73.54
Qwen2.5-VL-72B-Instruct 21.31 18.85 23.67 64.38 72.99
LLaVA-OV-Qwen2-72B 25.70 20.81 26.81 65.74 73.91
GPT-4o (w/o images) 27.02 23.34 27.99 65.66 73.99
GPT-4o 29.37 24.89 30.19 65.58 73.86
Zina (Ours) 45.15 44.02 50.39 66.08 74.36
(+15.8) (+19.1) (+20.2) (+0.34) (+0.37)

BibTeX

Coming soon...