A sequel! The new Nanuq series is meant to be as a testing grounds for my GRPO experiments, This model is meant to have great Instruct Following and System prompt Adherence in Creative Scenarios.
Built ontop of Austral Xgen 9B, I made an RL env using PrimeIntellect-ai/verifiers and implemented InternLM/POLAR in said env, then using Pocketdoc's Systemmax dataset, I finetuned the model for 150 steps and this was the result.
There's alot of things i could do different, As the reward almost falls flat as soon as you get out of warm-up but this model was pretty decent so decided to release it, Hope people enjoy it!