Currently, we are providing the dataset and code as supplementary materials.
Multimodal Large Language Models (MLLMs) often generate hallucinations, where the output deviates from the visual content. Given that these hallucinations can take diverse forms, detecting hallucinations at a fine-grained level is essential for comprehensive evaluation and analysis. To this end, we propose a novel task of multimodal fine-grained hallucination detection and editing for MLLMs. To tackle this task, we propose ZINA, a novel method that identifies hallucinated spans, classifies their error types into six categories, and suggests appropriate refinements. We trained ZINA on 20k synthetic samples generated via a new graph-based approach that captures the dependencies among errors. Moreover, we constructed the VisionHall dataset, which contains approximately 6.9k outputs generated by twelve MLLMs, with hallucinated spans manually annotated. We demonstrated that ZINA outperformed existing methods, including GPT-4o and LLama-3.2, in both detection and editing tasks.
Model | Detection | Editing | |||
---|---|---|---|---|---|
F1 | BERT-F1 | CLIP-F1 | CLIP-S | PAC-S | |
LLaVA-1.5-7B | 0.82 | 0.66 | 0.93 | 64.01 | 72.72 |
Qwen2-VL-7B | 3.36 | 3.62 | 4.98 | 64.79 | 73.01 |
LLaVA-OV-Qwen2-7B | 3.39 | 3.39 | 3.39 | 64.06 | 72.40 |
LLaVA-v1.5-13B | 4.73 | 5.08 | 6.71 | 64.74 | 73.02 |
LLaVA-NeXT-Qwen-32B | 19.09 | 24.29 | 31.06 | 65.34 | 73.47 |
Llama-3.2-90B-Vision-Instruct | 16.92 | 14.56 | 17.62 | 65.28 | 73.54 |
Qwen2.5-VL-72B-Instruct | 21.31 | 18.85 | 23.67 | 64.38 | 72.99 |
LLaVA-OV-Qwen2-72B | 25.70 | 20.81 | 26.81 | 65.74 | 73.91 |
GPT-4o (w/o images) | 27.02 | 23.34 | 27.99 | 65.66 | 73.99 |
GPT-4o | 29.37 | 24.89 | 30.19 | 65.58 | 73.86 |
Zina (Ours) | 45.15 | 44.02 | 50.39 | 66.08 | 74.36 |
(+15.8) | (+19.1) | (+20.2) | (+0.34) | (+0.37) |
Coming soon...