Gemma 3 4b unslop experiment

Update

Okay, I've received some good feedback.

This finetune is mostly geared towards fiction writing, unfortunately all the RP slop is still present.

Also, intelligence has been damaged a bit, as instruction following isn't great sometimes.

On the plus side, I'm pretty happy with it's new writing style when it does work.

Changes for the next version:

I'll diversify my dataset a little to hopefully mitigate some overfitting on the prompt format
I'll experiment with parameters like learning rate and LoRA rank a bit more
I'll see about including some RP data in my statistical analysis so I can target some of that slop, too

This is my first finetune. I used GRPO to reduce slop output.

This is a LoRA adapter, it needs to be merged with google/gemma-3-4b-it

I'll also upload a Q4_K_M GGUF made with unsloth's imatrix.

Tuning technique:

I generated lots of sample text and then sorted all bigrams and trigrams by frequency.

I added some of these to the reward function and penalized their use.

I also added some regex filters for comma overuse, and some sloppy phrasing, etc.

If the prompt doesn't include "rain", but model output includes rain, it gets penalized.

Same thing for "air". Gemma 3 LOVES to talk about rain and how the air tastes (or clings, etc.)... no more.

200 steps into training I activate lexical diversity comparison. It penalizes MTLD < 100, gives increasing rewards up to 120.

There's a callback for early stopping if reward stays high, but it didn't kick in this run.

This was trained on ~15 million tokens on a single 3090. I'm sharing my code so people can try their own finetuning runs.

I'll probably keep iterating on this a bit, and may update this model.

training code: train.py

I can't share my dataset, but here's an example of what it looks like: dataset_example.json

Gemma 3 4b common bigrams, most common first: bigrams.txt

Gemma 3 4b common trigrams, most common first: trigrams.txt