|
--- |
|
license: mit |
|
--- |
|
|
|
# DeepTheorem: Advancing LLM Reasoning for Theorem Proving Through Natural Language and Reinforcement Learning π |
|
|
|
Welcome to the GitHub repository for **DeepTheorem** π, a comprehensive framework for enhancing large language model (LLM) mathematical reasoning through informal, natural language-based theorem proving. This project introduces a novel approach to automated theorem proving (ATP) by leveraging the informal reasoning strengths of LLMs, moving beyond traditional formal proof systems π. |
|
|
|
## Overview π |
|
|
|
<p align="center"> |
|
<img src="frontpage.png" width="800" /> |
|
</p> |
|
|
|
Theorem proving is a critical benchmark for evaluating complex reasoning in LLMs π§ . However, formal proof systems often misalign with the informal, natural language knowledge LLMs acquire during pre-training. DeepTheorem addresses this gap by introducing: |
|
|
|
1. **A Large-Scale Theorem Proving Dataset** π: |
|
- Contains **high-quality, IMO-level informal theorems and proofs** across diverse mathematical domains π. |
|
- Rigorously annotated for correctness, difficulty, and topic categories β
. |
|
- Includes systematically constructed **verifiable theorem variants** for robust evaluation π. |
|
|
|
2. **RL-Zero** π€: |
|
- A novel **reinforcement learning strategy** tailored for informal theorem proving βοΈ. |
|
- Utilizes verified theorem variants to incentivize robust mathematical inference π‘. |
|
|
|
Deeptheorem demonstrates its superior performance over both opensource and commercial models (i.e Gemini and o1) |
|
|
|
## Performance π |
|
Deeptheorem achieves the #Rank 5 position along all the commerical models and open source models. |
|
|
|
| **Model** | **FIMO** | | **HMMT** | | **Putnam** | | **Avg.(\#Rank)** | | |
|
| :--------------------- | :------: | :-----: | :------: | :-----: | :--------: | :-----: | :------: | :-----: | |
|
| | *out.* | *proc.* | *out.* | *proc.* | *out.* | *proc.* | *out.* | *proc.* | |
|
| Gemini2\.5-Pro | 57\.14 | 54\.06 | 57\.63 | 49\.82 | 64\.58 | 58\.75 | 59\.78(\#2) | 54\.21(\#3) | |
|
| o1-mini | 60\.32 | 55\.23 | 35\.59 | 30\.90 | 61\.46 | 52\.88 | 52\.46(\#4) | 46\.34(\#4) | |
|
| o1 | 66\.67 | 61\.00 | 47\.46 | 47\.30 | 62\.50 | 57\.55 | 58\.88(\#3) | 55\.28(\#2) | |
|
| o3-mini | 80\.95 | 77\.61 | 45\.76 | 43\.47 | 78\.12 | 75\.12 | 68\.28(\#1) | 65\.40(\#1) | |
|
| *DeepTheorem-RL-7B | 55\.56 | 39\.07 | 28\.81 | 20\.85 | 57\.29 | 42\.20 | 47\.22(\#5) | 34\.04(\#5) | |
|
| *DeepTheorem-RL-3B | 38\.10 | 23\.39 | 25\.42 | 13\.56 | 52\.08 | 33\.84 | 38\.53 | 23.60 | |
|
| *DeepTheorem-RL-1.5B | 31\.75 | 15\.23 | 23\.73 | 10\.15 | 52\.08 | 22\.79 | 35\.85 | 16.06 | |
|
|
|
|
|
## DeepTheorem Dataset π |
|
|
|
The DeepTheorem dataset comprises **121K IMO-level informal theorems and proofs** spanning diverse mathematical domains π. Each theorem-proof pair is rigorously annotated for: |
|
- **o3-mini Proofs** ποΈ: Ensuring mathematical accuracy through proofs generated or verified by the o3-mini model β
. |
|
- **Truth Value** π: The truth value of the theorem extracted from the o3-mini proofs, indicating whether the theorem is true or false βοΈ. |
|
- **Difficulty** ποΈ: Categorized by complexity to suit various LLM capabilities π§©. |
|
- **Topic Categories** ποΈ: Covering algebra, geometry, number theory, and more π. |
|
- **Variants** π: Positive or negative variants of the theorem that share the same or inverse truth value of the original theorem π. |