arxiv:2507.16795

Steering Out-of-Distribution Generalization with Concept Ablation Fine-Tuning

Published on Jul 22

· Submitted by

kh4dien on Jul 23

Upvote

Authors:

Helena Casademunt ,

Caden Juang ,

Adam Karvonen ,

Abstract

Concept Ablation Fine-Tuning (CAFT) uses interpretability tools to steer LLM generalization away from unintended concepts without altering training data.

AI-generated summary

Fine-tuning large language models (LLMs) can lead to unintended out-of-distribution generalization. Standard approaches to this problem rely on modifying training data, for example by adding data that better specify the intended generalization. However, this is not always practical. We introduce Concept Ablation Fine-Tuning (CAFT), a technique that leverages interpretability tools to control how LLMs generalize from fine-tuning, without needing to modify the training data or otherwise use data from the target distribution. Given a set of directions in an LLM's latent space corresponding to undesired concepts, CAFT works by ablating these concepts with linear projections during fine-tuning, steering the model away from unintended generalizations. We successfully apply CAFT to three fine-tuning tasks, including emergent misalignment, a phenomenon where LLMs fine-tuned on a narrow task generalize to give egregiously misaligned responses to general questions. Without any changes to the fine-tuning data, CAFT reduces misaligned responses by 10x without degrading performance on the training distribution. Overall, CAFT represents a novel approach for steering LLM generalization without modifying training data.

View arXiv page View PDF Add to collection

Community

kh4dien

Paper author Paper submitter 6 days ago

Project page: https://cadentj.github.io/caft/

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2507.16795 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2507.16795 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2507.16795 in a Space README.md to link it from this page.