Benchmark It Yourself (BIY): Preparing a Dataset and Benchmarking AI Models for Scatterplot-Related Tasks
Abstract
A benchmark for scatterplot-specific tasks using synthetic datasets evaluates AI models' performance in counting clusters and identifying outliers, with mixed results for localization tasks.
AI models are increasingly used for data analysis and visualization, yet benchmarks rarely address scatterplot-specific tasks, limiting insight into performance. To address this gap for one of the most common chart types, we introduce a synthetic, annotated dataset of over 18,000 scatterplots from six data generators and 17 chart designs, and a benchmark based on it. We evaluate proprietary models from OpenAI and Google using N-shot prompting on five distinct tasks derived from annotations of cluster bounding boxes, their center coordinates, and outlier coordinates. OpenAI models and Gemini 2.5 Flash, especially when prompted with examples, are viable options for counting clusters and, in Flash's case, outliers (90%+ Accuracy). However, the results for localization-related tasks are unsatisfactory: Precision and Recall are near or below 50%, except for Flash in outlier identification (65.01%). Furthermore, the impact of chart design on performance appears to be a secondary factor, but it is advisable to avoid scatterplots with wide aspect ratios (16:9 and 21:9) or those colored randomly. Supplementary materials are available at https://github.com/feedzai/biy-paper.
Community
Benchmark It Yourself (BIY): Preparing a Dataset and Benchmarking AI Models for Scatterplot-Related Tasks
Contributions
- A synthetic, annotated dataset (and its generation pipeline) for scatterplot-related tasks.
- A comprehensive evaluation of the performance of ten proprietary models on said tasks.
- A list of considerations when designing charts and providing them as input to AI models.
Quickstart
- Paper: https://arxiv.org/abs/2510.06071
- Dataset and benchmark: https://github.com/feedzai/biy-paper
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- InterChart: Benchmarking Visual Reasoning Across Decomposed and Distributed Chart Information (2025)
- Are LLMs ready to help non-expert users to make charts of official statistics data? (2025)
- Look Before you Leap: Estimating LLM Benchmark Scores from Descriptions (2025)
- Meet Your New Client: Writing Reports for AI - Benchmarking Information Loss in Market Research Deliverables (2025)
- Can we Evaluate RAGs with Synthetic Data? (2025)
- Is this chart lying to me? Automating the detection of misleading visualizations (2025)
- Benchmark Dataset Generation and Evaluation for Excel Formula Repair with LLMs (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper
![Preview of the new scatterplot dataset and a list of the main features of the dataset and benchmark. The top section shows six of the scatterplots paired with their annotated versions. These versions are layered with cluster bounding boxes, cluster centers, and outliers. From left to right, the first scatterplot is composed of six clusters spread diagonally from the top-left to the bottom-right corner; the second is composed of four scattered, elongated clusters and background noise; the third has one cluster in the top-left corner and four outliers near the top-right and bottom-right corners; the fourth has three clusters that form a triangle; the fifth has a set of exponentially related points; the sixth has two spread out clusters, one on top and one on the bottom, and background noise. The bottom section, on the left, notes that the dataset is a new synthetic dataset, composed of over 18,000 annotated scatterplots and 17 chart designs. In the middle, it lists that the benchmark is composed of 1725 scatterplots, 10 models from OpenAI and Google, 5 tasks, and 3 prompting strategies. On the right, there are several examples of actual model responses next to the question: "What performance?". The examples are: {"clusters": [], "outliers": []}, (5), (38), (2), (100), {"cluster_centers": [[278, 747], [768, 250]]}, (0), (1), {"outliers": [[50, 50], [950, 50], [950, 950], [50, 950]]}, and {"clusters": [[66, 27, 438, 224], [466, 370, 927, 570]]}.](https://cdn-uploads.huggingface.co/production/uploads/67ec4013d674ac4ba71dd264/lYYXsJ24aqoP9dv1jssZQ.png)