Spider2-V: How Far Are Multimodal Agents From Automating Data Science and Engineering Workflows?
Spider2V is a benchmark for multimodal agents aiming to automate data science and engineering workflows. It includes 494 real-world tasks in authentic computer environments and 20 professional applications. Existing state-of-the-art large-language models struggle to reliably automate full data workflows, achieving only around 14% success rate. Multimodal agents face challenges in tasks requiring fine-grained GUI actions and remote cloud-hosted workspaces. Spider2V provides a realistic simulation environment for evaluating multimodal agents' performance in executing data-related tasks. The benchmark aims to bridge the gap in automating entire data workflows by integrating code generation and GUI controls.