🛠 ML-Agents Tips & Lessons Learned (AutoMind + MLE-Bench)
Over the past few months, we’ve been experimenting with AutoMind on MLE-Bench, and in this post we’ll share some of the lessons we’ve learned along the way — from practical workflow tips and model-specific tuning tricks, to the pitfalls we hit and how we solved them. We’ll also take a step back to reflect on what these experiments reveal about where intelligent agents might be heading next.
For those unfamiliar, MLE-Bench (Machine Learning Engineer Benchmark) is an open benchmarking platform built to evaluate how well large language model (LLM) agents handle real engineering tasks — things like coding, debugging, and data analysis. It’s designed to go beyond synthetic prompts and test a model’s actual ability to execute and reason in complex, realistic workflows.
0. The Gauntlet: Why We Pitched an Agent Against Kaggle Grandmasters 🎭
The era of AI as a mere chatbot is over. We’ve entered the innovation age, where the grand challenge is to automate complex, end-to-end workflows. At the forefront of this frontier is data science—a messy, intricate craft that has long resisted simple automation.
Most recent progress in this area comes from context engineering, retrieval-augmented reasoning, and search-based planning, which temporarily patch the weakness of the underlying models by injecting external structure and feedback signals. But these are stopgaps, not solutions—the deeper issue lies in the lack of dense, grounded reward signals and limited self-consistency within base models. In the long run, true breakthroughs may require improving the base model’s intrinsic reasoning and reinforcement dynamics, not just layering smarter scaffolding on top.
To build an AI that can truly master data science, we need a ruthless, realistic proving ground.
Challenge: Handling Diverse, Real-World Workflows
The task we set for data-science agents is deceptively simple: given a raw dataset as input, automatically produce a high-quality model and submission file as output. In practice, this means not just training a model, but also handling the messy realities of preprocessing, feature engineering, evaluation, and submission formatting under strict resource and time constraints.
Open Sourced Agent Framework
- MLAB: an agent framework for ML benchmarks with multi-step planning and tool use for end-to-end pipelines.
- AIDE: an LLM-based data-science agent that decomposes tasks and iteratively executes code with validation.
- OpenHands: an open-source computer-use agent platform (browser/terminal tools) for long-horizon tasks.
- RD-agent: an agent that integrates retrieval and reflection to guide decisions during iterative code generation.
AutoMind: An Adaptive Agent for Automated Data Science
We will report a series of practical experiences over the past few months using AutoMind on the MLE-Bench benchmark. AutoMind is a knowledge-augmented data science agent system that enhances large language model agents’ data science capabilities by integrating an expert knowledge base, tree search, and adaptive encoding strategies.
(The following results were validated on the AutoMind Lite subset, which includes 15 tasks. Experiments were conducted on Tesla V100 32 GB GPUs. Note that the runs are computationally and token intensive, with relatively high variance. You can check the detailed logs on Google Drive.)
1. Task Context 📝
The following MLE-Bench tasks were targeted:
Competition | Defficulty | Category | Dataset Size (GB) |
---|---|---|---|
aptos2019-blindness-detection | Easy | Image Classification | 10.22 |
random-acts-of-pizza | Easy | Text Classification | 0.003 |
spooky-author-identification | Easy | Text Classification | 0.0019 |
google-quest-challenge | Easy | Training LLMs | 0.015 |
stanford-covid-vaccine | Easy | Tabular | 2.68 |
predict-volcanic-eruptions-ingv-oe | Easy | Signal Processing | 31.25 |
lmsys-chatbot-arena | Medium | Text Classification | 0.18 |
us-patent-phrase-to-phrase-matching | Medium | Text Regression | 0.00214 |
mlsp-2013-birds | Medium | Audio Classification | 0.5851 |
statoil-iceberg-classifier-challenge | Medium | Image Classification | 0.3021 |
tensorflow-speech-recognition-challenge | Medium | Audio Classification | 3.76 |
denoising-dirty-documents | Hard | Image to Image | 0.06 |
new-york-city-taxi-fare-prediction | Hard | Tabular | 5.7 |
tgs-salt-identification-challenge | Hard | Image Segmentation | 0.5 |
ventilator-pressure-prediction | Hard | Forecasting | 0.7 |
This subset covers three difficulty levels from 'Easy' to 'Hard' and encompasses multiple data modalities, including image, text, tabular, signal processing, and audio. More importantly, these tasks vary significantly in their computational resource demands, dataset sizes, and potential pitfalls. Ranging from text classification tasks of a few megabytes to a signal processing challenge exceeding 30 gigabytes, this diverse set provides an excellent proving ground for validating the agent's self-adaptive strategies.
2. Experiment Result 🔬
(What you need to know before reading the table.)
- Base model: We used DeepSeek V3 as the default base LLM for all AutoMind runs.
- Beat Ratio: Instead of the coarse “medal” metric, we report Beat Ratio = ( competitors your submission outranks) / (total competitors on the private LB).
It is a fine-grained, task-agnostic gauge of how much we closed the human gap.
- Best-of-3 reporting: MLE-Bench agents are notoriously high-variance; e.g. the same prompt can yield 0.29 → 1.00 Beat Ratio on stanford-covid-vaccine. We therefore ran 3 seeds per task and present the best score (the metric that matters in practice).
Competition | Metric (Best of 3) | Beat Ratio |
---|---|---|
aptos2019-blindness-detection | 0.8893 | 0.50 |
random-acts-of-pizza | 0.63993 | 0.61 |
spooky-author-identification | 0.40524 | 0.54 |
google-quest-challenge | 0.39775 | 0.95 |
stanford-covid-vaccine | 0.24297 | 1.0 |
predict-volcanic-eruptions-ingv-oe | 3558531.0 | 1.0 |
lmsys-chatbot-arena | 1.03812 | 0.56 |
us-patent-phrase-to-phrase-matching | 0.8403 | 0.34 |
mlsp-2013-birds | 0.90998 | 0.81 |
statoil-iceberg-classifier-challenge | 0.20295 | 0.50 |
tensorflow-speech-recognition-challenge | 0.35331 | 0.21 |
denoising-dirty-documents | 0.00791 | 1.0 |
new-york-city-taxi-fare-prediction | 5.29154 | 0.20 |
tgs-salt-identification-challenge | 0.7774 | 0.35 |
ventilator-pressure-prediction | 0.83103 | 0.22 |
📂 Download Experiment Results (logs & submissions)
3. Runtime 📦
You can find the full datasets (data), experiment logs, solutions, and intermediate results in the complete runtime package, available from the header link or directly here: https://drive.google.com/drive/folders/1pyZXWPYR262NIXCrzD2NWpJHbdgiLRFR?usp=drive_link
Automind_runtime contains (approximate sizes):
data (~33G): Full datasets required to run the tasks, equivalent to the output of running mlebench prepare --automind.
runs (~179G): Complete runtimes for AutoMind and variants/baselines:
- Automind_v3 (original version of Automind with deepseek v3 as base model)
- Automind_o3_mini (AutoMind with OpenAI's o3-mini as base model)
- Automind_wo_knowledge (AutoMind without the knowledge module, under the base model of deepseek v3)
- Aide_v3 (Aide baseline with deepseek v3 as base model)
- Aide_o3_mini (Aide baseline with OpenAI's o3-mini as base model)
Each includes logs, submissions, solution code, and time-stamped intermediate results.
traj (~197M): For each runtime, the extracted trajectory of iterations from the root node to the best-performing node.
Overall size of the complete runtime package: ~212G.
4. Confirmed Findings ✅
Below are the battle-tested insights we repeatedly observed and verified during 48 GPU-days of AutoMind runs on the AutoMind Lite split. Each bullet is framed for immediate plug-and-play adoption in any similar agentic workflow.
4.1 General Workflow Tricks
- Never hard-code evaluation metrics
LLMs may hallucinate performance metrics without running validation. For instance, we observed a "perfect" metric being fabricated for thetgs-salt-identification-challenge
, which caused the tree-search algorithm to lock onto a suboptimal, hallucinated solution and stall further exploration. Therefore, it is crucial to enforce that metrics are always computed from actual validation results. - Disable progress bars
Utilities liketqdm
can floodstdout
with thousands of lines of refresh characters, consuming valuable context window space. This verbose output can degrade the LLM's performance in subsequent steps. To mitigate this, terminal output should be kept clean by disabling such progress bars. - Avoid printing entire model architectures
By default, some implementations print the full model architecture, creating redundant context. This behavior should be disabled, with architectural details surfaced only on-demand during specific debugging phases. - Debug data issues at the source
LLMs can overlook critical details in long contexts. When encountering data-related errors, we observed that the model might ignore existing data analysis information and instead hallucinate a cause. To prevent this, debugging should always begin by referencing the original data analysis to understand data formats, thereby avoiding erroneous assumptions. - Enumerate data paths explicitly
To prevent the agent from hallucinating non-existent file paths, it is recommended to explicitly enumerate all available data files, for instance, by usingos.walk
to scan the input directory. - Leverage
sample_submission.csv
Generating submission files from scratch risks introducing formatting errors by the LLM. A more robust approach is to use the providedsample_submission.csv
as a template, ensuring the final output adheres to the required structure. - Merge expensive steps
AutoMind's adaptive coding strategy decomposes tasks into sequential steps. We observed that model training is typically the most resource-intensive step. If subsequent operations exist after training, the entire training process must be re-executed during each iteration, leading to significant time costs. To optimize this, model training and any downstream operations should be merged into a single, final step to prevent the costly re-running of training during incremental code development. - Merge long step chains
We found that excessively long chains of decomposed steps can degrade the performance of AutoMind's adaptive coding, increasing the likelihood of generating buggy code. As a practical guideline, if a task decomposition results in more than six steps, they should be consolidated to maintain stability and performance. - Proactively Implement Anti-Overfitting Measures An agent focused solely on maximizing validation metrics can easily produce a model that is perfectly tuned to the training data but fails on the unseen test set. To ensure robust generalization, it is essential to explicitly instruct the agent to incorporate a standard toolkit of anti-overfitting techniques—such as data augmentation, early stopping, regularization, and dropout—as part of its core modeling strategy.
While the general principles in the previous section apply broadly, successful implementation of AutoMind also requires tailoring interactions to the specific base model used. The following findings are specific to optimizing prompts and workflows for the DeepSeek-V3 agent.
- Do not wrap generated code
In one version, we observed a severe degradation in the agent's coding performance. Our analysis revealed that the issue was caused by requiring the LLM to generate code in a<think> … </think>
+<code> … </code>
format during coding. We found that directly outputting raw code yields better results. - Prefer GPU execution
By default, the LLM often generates code for CPU execution, leading to significant performance bottlenecks and underutilization of available hardware. To prevent this, prompts must explicitly instruct the agent to generate GPU-accelerated algorithms whenever applicable, ensuring computational resources are used efficiently. - Prefer precise debugging over full rewrites
# Exact search/replace diff format
<<<<<<< SEARCH
# exact code to replace (must match exactly)
=======
# new code
>>>>>>> REPLACE
- Match code exactly, including whitespace.
- Explain the reasoning for each change.
- Multiple diff blocks are allowed per fix.
- Prioritize Modern Architectures and Techniques Left to its own devices, an LLM might propose well-known but outdated methods, such as using traditional CNNs or LSTMs for tasks where transformers now dominate. To ensure a competitive edge, explicitly instruct the agent to leverage state-of-the-art models and feature engineering techniques. This pushes the agent beyond "textbook" solutions and towards methods that reflect current best practices.
4.2 Pitfalls
Beyond outright code failures, several strategic traps can silently undermine an agent's performance.
- Hyperparameter search can hurt
Typically, simply instructing the LLM to perform hyperparameter optimization (HPO) leads to severe overfitting. Overfitting to validation data can degrade test-set performance. - Handle training timeouts
Some runs hit Kaggle’s 9-hour limit, especially during heavy training stages.- Training-stage timeouts are common in:
aptos2019-blindness-detection
,ventilator-pressure-prediction
- Feature-engineering-stage timeouts are common in:
predict-volcanic-eruptions-ingv-oe
- Training-stage timeouts are common in:
5. Hypothesized Findings 🔍
The items below are educated guesses based on observations but not yet fully proven.
- Randomness may stem from search dynamics
Instanford-covid-vaccine
, we observedbeat_ratio
values ranging from 0.2914653784219002 to 1 with near-identical setups.
Hypothesis:- The LLM knows a reasonable range of correct approaches but not the exact best algorithm/settings.
- Greedy search keeps improving whichever node already has results, regardless of absolute quality.
- This yields imbalanced exploration vs. exploitation—drilling down one “good enough” path without broader search—sometimes converging to sub-optimal solutions.
- Condense excessive CLI output
Algorithms often flood the console with redundant logs (progress bars, full model summaries). One possible solution is to summarize these long traces with an LLM to save context space and surface only the essentials. - Save submissions every few epochs
If training times out—code is bug-free, training starts, but the run never reaches inference—save a checkpoint and run inference for submission after every few epochs. - Use a curriculum approach for code generation
We have observed that directly generating code for a complex task is challenging, often failing to produce bug-free results on the first attempt, with subsequent debugging efforts also proving unsuccessful. A more promising strategy is to adopt a curriculum-based approach. This process involves first generating code for a simpler version of the task, and then iteratively refining this initial code to handle the requirements of the more complex task. This incremental approach appears to increase the overall success rate of code generation, much like how a curriculum of progressively more challenging environments enables gradual learning.
6. Potential Future Improvements 🚀
- Checkpointed search
Serialize/deserialize the solution tree so searches can resume without losing progress. - Async architecture
Decouple reasoning from code execution to enable parallel workflows. - Ensemble methods
Combine multiple solutions for more robust final results.