Re:Form -- Reducing Human Priors in Scalable Formal Software Verification with RL in LLMs: A Preliminary Study on Dafny

This repository hosts models and resources from the Re:Form project, as detailed in the paper:

Re:Form introduces a novel approach to formal software verification using Large Language Models (LLMs) by grounding them in rigorous formal systems like Dafny. This enables the automatic and mathematically provable verification of their reasoning processes and outcomes, addressing the challenges of reliability and scalability faced by existing informal language-based LLMs. The project systematically explores ways to reduce reliance on human priors by integrating an automatic and scalable data curation pipeline with careful Reinforcement Learning (RL) designs that leverage feedback from formal language verifiers.

Overall Pipeline

Key Highlights

  • Formal Language Grounding: Re:Form grounds LLMs in formal languages like Dafny, enabling automatic and mathematically provable verification of their reasoning and outcomes.
  • Reduced Human Priors: It systematically minimizes the need for extensive human-annotated chain-of-thought and other priors for complex programming tasks.
  • Automated Data Curation: Introduces an automatic and scalable pipeline for curating training data.
  • RL Integration: Incorporates careful Reinforcement Learning designs integrated with feedback from formal language verifiers.
  • DafnyComp Benchmark: Introduces DafnyComp, a new benchmark of compositional formal programs with auto-formalized specifications.
  • Superior Performance: Even small Re:Form models (e.g., 0.5B) demonstrate the ability to generate syntactically valid and verifiable Dafny code, surpassing proprietary models. RL further enhances performance, achieving stronger generalization to out-of-domain tasks and outperforming strong baselines on DafnyComp.

Model Checkpoints

The project provides Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) fine-tuned model checkpoints. You can find them under the Veri-Code organization on Hugging Face.

Available checkpoints:

  • SFT Models: sft_0.5B, sft_1.5B, sft_3B, sft_7B, sft_14B
  • RL Fine-tuned Model: 14B-RL-entropy

Datasets

The project utilizes the following datasets:

  • Python2Dafny: 18k examples
  • DafnyComp: 300 examples

These datasets are in JSON format and can be transformed into parquet files for training using the src.data_preprocess script available in the GitHub repository.

How to Use

You can easily load and use the Re:Form models with the Hugging Face transformers library. This example demonstrates how to generate Dafny code based on a prompt.

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

# Load the model and tokenizer (replace with your desired checkpoint, e.g., "Veri-Code/sft_14B")
model_name = "Veri-Code/sft_1.5B" 
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16, # Use torch.float16 if bfloat16 is not supported by your hardware
    device_map="auto",
    trust_remote_code=True # Required for models with custom code or architecture extensions
)

# Example prompt for Dafny code generation (Qwen2 specific chat format)
# The model is trained to generate Dafny code given a context or problem statement.
prompt = """<|im_start|>user
Here is a simple sorting algorithm in Dafny:
```dafny
method InsertionSort<T>(a: array<T>) modifies a
  requires forall i, j :: 0 <= i < j < a.Length ==> a[i] <= a[j]
  ensures forall i, j :: 0 <= i < j < a.Length ==> a[i] <= a[j]
{
```<|im_end|>
<|im_start|>assistant
```dafny"""

# Apply chat template and tokenize
# This model uses a chat template similar to Qwen2, ensure the prompt matches training format.
model_inputs = tokenizer([prompt], return_tensors="pt").to(model.device)

# Generate Dafny code
generated_ids = model.generate(
    model_inputs.input_ids,
    max_new_tokens=256, # Adjust as needed for expected code length
    do_sample=True,
    temperature=0.7,
    top_p=0.9,
    eos_token_id=tokenizer.eos_token_id, # Ensure generation stops at EOS token
)

# Decode and print the generated code
generated_text = tokenizer.decode(generated_ids[0], skip_special_tokens=False)
print(generated_text)

# Note: The exact prompting format for optimal Dafny code generation and verification might
# involve specific tokens or structures as used during the model's training.
# Refer to the [GitHub repository](https://github.com/Veri-Code/ReForm) for more detailed
# usage examples, training scripts, and interaction with the Dafny environment.

Citation

If you use this work in your research, please cite:

@misc{yan2025reformreducinghuman,
      title={Re:Form -- Reducing Human Priors in Scalable Formal Software Verification with RL in LLMs: A Preliminary Study on Dafny}, 
      author={Chuanhao Yan and Fengdi Che and Xuhan Huang and Xu Xu and Xin Li and Yizhi Li and Xingwei Qu and Jingzhe Shi and Zhuangzhuang He and Chenghua Lin and Yaodong Yang and Binhang Yuan and Hang Zhao and Yu Qiao and Bowen Zhou and Jie Fu},
      year={2025},
      eprint={2507.16331},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2507.16331}, 
}
Downloads last month
11
Safetensors
Model size
14.8B params
Tensor type
BF16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for Veri-Code/ReForm-SFT-14B

Quantizations
1 model

Collection including Veri-Code/ReForm-SFT-14B