Activation-Level Preference Unlearning (AG-Masked-LoRA)

Removing Latent Concepts While Preserving Global LLM Reasoning

Abstract

Large Language Models (LLMs) increasingly power recommender systems, yet they often exhibit unstable or biased preference formation. Minor variations in prompt phrasing can activate different internal representations, leading to inconsistent or policy-violating outputs.

This project introduces Activation-Guided Masked LoRA (AG-Masked-LoRA), a targeted unlearning method that identifies and suppresses the activation subspace responsible for an undesired concept—demonstrated here with movie-title generation (“Inception”).

Our pipeline integrates:

Activation probing
Prompt perturbation stability analysis
Gradient and saliency mapping
Fisher information profiling
Subspace-masked LoRA training
Incremental concept-level unlearning

Results show that the model cleanly forgets the targeted concept while preserving reasoning, fluency, and instruction fidelity.

Motivation

LLM-based recommendation and generation systems embed user intent, item associations, and implicit priors in high-dimensional activation pathways. While powerful, this creates challenges:

Over-specific or incorrect recommendations due to activation drift.
Entrenched behaviors from prior fine-tuning.
Difficulty suppressing copyrighted, unsafe, or policy-restricted content.
Entanglement of desirable and undesirable behaviors within shared neuron groups.

Understanding how specific prompts activate internal representations is critical both for trustworthy recommenders and for enterprise-grade safety alignment.

Activation-guided unlearning specifically addresses this: by identifying which neurons encode an unwanted concept and restricting LoRA updates to that region of latent space, we can remove a capability rather than merely filtering tokens.

Phase 1 — Prompt Perturbation & Instability Analysis

Prompt variations intended to be semantically identical yield inconsistent movie-title recommendations, revealing instability in how Qwen2.5-3B processes preference queries.

Figure 1. Semantically equivalent prompts produce different responses, indicating latent-space sensitivity and inconsistent preference encoding.

Figure 2. Direct prompt-perturbation: phrasing changes alter the generated movie title, confirming activation-level instability.

Phase 2 — Activation Probing, Saliency, and Gradient Sensitivity

We analyze how each transformer layer responds when the model attempts to generate a movie title.

Layerwise Gradient Sensitivity

Figure 3. Gradient sensitivity map showing which layers’ activations shift most strongly in response to movie-title prompting.

Saliency (Gradient × Activation)

Figure 4. Saliency heatmap identifying layers whose neurons strongly encode the movie-title concept.

Combined Sensitivity Analysis

Figure 5. Layerwise correlation of saliency, Fisher information, and activation similarity identifies a consistent high-impact region in mid-model layers.

Phase 3 — Semantic Similarity vs Activation Structure

We measure whether semantic similarity across prompts matches activation-level similarity.

Figure 6. Semantic similarity (top) vs activation overlap (bottom).
Prompts that mean the same thing do not necessarily activate the same neurons—revealing a root cause of preference drift.

Phase 4 — Fisher Information Profiling

Figure 7. Mean gradient norm per layer, pinpointing where the model is most sensitive.

Figure 8. Fisher information heatmap showing which neurons maintain the highest influence on movie-title generation.

Phase 5 — Activation-Guided Masked LoRA (AG-Masked-LoRA)

A low-rank update is selectively applied only to neurons identified as encoding the targeted concept.

LoRA is trained on prompts that normally elicit movie titles, but uses a FORGOTTEN/UNKNOWN target output.
The update is masked to affect only sensitive neurons, leaving the rest of the model untouched.

Figure 9. Incremental unlearning logs showing loss reduction while applying masked LoRA updates.

Phase 6 — Evaluation: Before/After Unlearning

Base Model (Before/After)

Unlearned Model (Before/After)

Figure 10–11. The unlearned model consistently returns FORGOTTEN/UNKNOWN across paraphrased prompts.

Direct Concept Probing (Before/After)

Figure 12. Even when asked explicitly about “Inception,” the model no longer retrieves or describes it—indicating true concept removal.

Final Findings

Our experiments confirm that AG-Masked-LoRA performs structural semantic unlearning, not superficial keyword suppression.

Key Results

Generalizes across paraphrasing
Unlearning holds under all prompt-perturbation variants.
Consistent neuron clusters identified
Saliency + Fisher converge on the same mid-model layers.
Clear activation shift
PCA and activation distance show pre/post separation.
Global reasoning preserved
No degradation in unrelated tasks or instruction following.
Deployment ready
Runs cleanly on A100, L4, and Jetson Orin.

Conclusion

AG-Masked-LoRA removes entire latent concepts by rewriting only the activation pathways responsible for them. This makes it suitable for:

Safety-critical filtering
Policy enforcement
Copyright-restricted retrieval removal
Reversible domain-specific behavior modules

The base model remains unmodified—only a small adapter controls the behavior.

Project Resources and Repositories

GitHub — Full Source Code

LLM Preference Unlearning (Activation-Level Framework)
https://github.com/rameyjm7/llm-preference-unlearning

Includes:

Modular notebooks (00–08)
Unified pipeline notebook
Activation probe scripts
Saliency, gradient, Fisher analysis
Incremental unlearning engine
Figures and logs

HuggingFace — Model Card & Artifacts

Activation-Level Preference Unlearning (HF)
https://huggingface.co/rameyjm7/llm-preference-unlearning

Includes:

Model card
Figures and evaluation
Adapter artifacts (optional)
Notebook links

License

MIT License.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Evaluation results

Metadata error: specify a dataset to view leaderboard