MVTamperBench: Evaluating Robustness of Vision-Language Models
Abstract
Recent advancements in Vision-Language Models (VLMs) have enabled significant progress in complex video understanding tasks. However, their robustness to real-world manipulations remains underexplored, limiting their reliability in critical applications. To address this gap, we introduce MVTamperBench, a comprehensive benchmark designed to evaluate VLM's resilience to video tampering effects, including rotation, dropping, masking, substitution, and repetition. By systematically assessing state-of-the-art models, MVTamperBench reveals substantial variability in robustness, with models like InternVL2-8B achieving high performance, while others, such as Llama-VILA1.5-8B, exhibit severe vulnerabilities. To foster broader adoption and reproducibility, MVTamperBench is integrated into VLMEvalKit, a modular evaluation toolkit, enabling streamlined testing and facilitating advancements in model robustness. Our benchmark represents a critical step towards developing tamper-resilient VLMs, ensuring their dependability in real-world scenarios. Project Page: https://amitbcp.github.io/MVTamperBench/
- Publication:
-
arXiv e-prints
- Pub Date:
- December 2024
- DOI:
- arXiv:
- arXiv:2412.19794
- Bibcode:
- 2024arXiv241219794A
- Keywords:
-
- Computer Science - Computer Vision and Pattern Recognition;
- 68T37;
- 68T05;
- 68Q32;
- 68T45;
- 94A08;
- 68T40;
- 68Q85;
- I.2.10;
- I.2.7;
- I.5.4;
- I.4.9;
- I.4.8;
- H.5.1