ZINA: Multimodal Fine-grained Hallucination Detection and Editing

Abstract

Multimodal Large Language Models (MLLMs) often generate hallucinations, where the output deviates from the visual content. Given that these hallucinations can take diverse forms, detecting hallucinations at a fine-grained level is essential for comprehensive evaluation and analysis. To this end, we propose a novel task of multimodal fine-grained hallucination detection and editing for MLLMs. To tackle this task, we propose ZINA, a novel method that identifies hallucinated spans, classifies their error types into six categories, and suggests appropriate refinements. We trained ZINA on 20k synthetic samples generated via a new graph-based approach that captures the dependencies among errors. Moreover, we constructed the VisionHall dataset, which contains approximately 6.9k outputs generated by twelve MLLMs, with hallucinated spans manually annotated. We demonstrated that ZINA outperformed existing methods, including GPT-4o and LLama-3.2, in both detection and editing tasks.

Quantitative comparison with baseline methods. **Bold** font indicates the best, and underlined font indicates the second best.
Model	Detection	Editing
LLaVA-1.5-7B	0.82	0.66	0.93	64.01	72.72
Qwen2-VL-7B	3.36	3.62	4.98	64.79	73.01
LLaVA-OV-Qwen2-7B	3.39	3.39	3.39	64.06	72.40
LLaVA-v1.5-13B	4.73	5.08	6.71	64.74	73.02
LLaVA-NeXT-Qwen-32B	19.09	24.29	31.06	65.34	73.47
Llama-3.2-90B-Vision-Instruct	16.92	14.56	17.62	65.28	73.54
Qwen2.5-VL-72B-Instruct	21.31	18.85	23.67	64.38	72.99
LLaVA-OV-Qwen2-72B	25.70	20.81	26.81	65.74	73.91
GPT-4o (w/o images)	27.02	23.34	27.99	65.66	73.99
GPT-4o	29.37	24.89	30.19	65.58	73.86
Zina (Ours)	45.15	44.02	50.39	66.08	74.36
	(+15.8)	(+19.1)	(+20.2)	(+0.34)	(+0.37)

Quantitative comparison with baseline methods. Bold font indicates the best, and underlined font indicates the second best.

Model

Detection

Editing

F₁

BERT-F₁

CLIP-F₁

CLIP-S

PAC-S

LLaVA-1.5-7B

0.82

0.66

0.93

64.01

72.72

Qwen2-VL-7B

3.36

3.62

4.98

64.79

73.01

LLaVA-OV-Qwen2-7B

3.39

64.06

72.40

LLaVA-v1.5-13B

4.73

5.08

6.71

64.74

73.02

LLaVA-NeXT-Qwen-32B

19.09

24.29

31.06

65.34

73.47

Llama-3.2-90B-Vision-Instruct

16.92

14.56

17.62

65.28

73.54

Qwen2.5-VL-72B-Instruct

21.31

18.85

23.67

64.38

72.99

LLaVA-OV-Qwen2-72B

25.70

20.81

26.81

65.74

73.91

GPT-4o (w/o images)

27.02

23.34

27.99

65.66

73.99

GPT-4o

29.37

24.89

30.19

65.58

73.86

Zina (Ours)

45.15

44.02

50.39

66.08

74.36

(+15.8)

(+19.1)

(+20.2)

(+0.34)

(+0.37)

Zina: Multimodal Fine-grained Hallucination Detection and Editing

Abstract

Overview of the proposed task. In contrast to conventional tasks, the model is expected to detect hallucinated spans, classify their types based on a taxonomy, and suggest appropriate refinements.

Synthetic Training Data Curation

Quantitative Comparison

BibTeX