Jiahao004's picture
Update README.md
c3bda66 verified
---
license: mit
---
# DeepTheorem: Advancing LLM Reasoning for Theorem Proving Through Natural Language and Reinforcement Learning πŸš€
Welcome to the GitHub repository for **DeepTheorem** πŸŽ‰, a comprehensive framework for enhancing large language model (LLM) mathematical reasoning through informal, natural language-based theorem proving. This project introduces a novel approach to automated theorem proving (ATP) by leveraging the informal reasoning strengths of LLMs, moving beyond traditional formal proof systems 🌟.
## Overview πŸ“–
<p align="center">
<img src="frontpage.png" width="800" />
</p>
Theorem proving is a critical benchmark for evaluating complex reasoning in LLMs 🧠. However, formal proof systems often misalign with the informal, natural language knowledge LLMs acquire during pre-training. DeepTheorem addresses this gap by introducing:
1. **A Large-Scale Theorem Proving Dataset** πŸ“Š:
- Contains **high-quality, IMO-level informal theorems and proofs** across diverse mathematical domains πŸ“š.
- Rigorously annotated for correctness, difficulty, and topic categories βœ….
- Includes systematically constructed **verifiable theorem variants** for robust evaluation πŸ”.
2. **RL-Zero** πŸ€–:
- A novel **reinforcement learning strategy** tailored for informal theorem proving βš™οΈ.
- Utilizes verified theorem variants to incentivize robust mathematical inference πŸ’‘.
Deeptheorem demonstrates its superior performance over both opensource and commercial models (i.e Gemini and o1)
## Performance πŸš€
Deeptheorem achieves the #Rank 5 position along all the commerical models and open source models.
| **Model** | **FIMO** | | **HMMT** | | **Putnam** | | **Avg.(\#Rank)** | |
| :--------------------- | :------: | :-----: | :------: | :-----: | :--------: | :-----: | :------: | :-----: |
| | *out.* | *proc.* | *out.* | *proc.* | *out.* | *proc.* | *out.* | *proc.* |
| Gemini2\.5-Pro | 57\.14 | 54\.06 | 57\.63 | 49\.82 | 64\.58 | 58\.75 | 59\.78(\#2) | 54\.21(\#3) |
| o1-mini | 60\.32 | 55\.23 | 35\.59 | 30\.90 | 61\.46 | 52\.88 | 52\.46(\#4) | 46\.34(\#4) |
| o1 | 66\.67 | 61\.00 | 47\.46 | 47\.30 | 62\.50 | 57\.55 | 58\.88(\#3) | 55\.28(\#2) |
| o3-mini | 80\.95 | 77\.61 | 45\.76 | 43\.47 | 78\.12 | 75\.12 | 68\.28(\#1) | 65\.40(\#1) |
| *DeepTheorem-RL-7B | 55\.56 | 39\.07 | 28\.81 | 20\.85 | 57\.29 | 42\.20 | 47\.22(\#5) | 34\.04(\#5) |
| *DeepTheorem-RL-3B | 38\.10 | 23\.39 | 25\.42 | 13\.56 | 52\.08 | 33\.84 | 38\.53 | 23.60 |
| *DeepTheorem-RL-1.5B | 31\.75 | 15\.23 | 23\.73 | 10\.15 | 52\.08 | 22\.79 | 35\.85 | 16.06 |
## DeepTheorem Dataset πŸ“Š
The DeepTheorem dataset comprises **121K IMO-level informal theorems and proofs** spanning diverse mathematical domains 🌐. Each theorem-proof pair is rigorously annotated for:
- **o3-mini Proofs** πŸ–‹οΈ: Ensuring mathematical accuracy through proofs generated or verified by the o3-mini model βœ….
- **Truth Value** πŸ”: The truth value of the theorem extracted from the o3-mini proofs, indicating whether the theorem is true or false βœ”οΈ.
- **Difficulty** 🎚️: Categorized by complexity to suit various LLM capabilities 🧩.
- **Topic Categories** πŸ—‚οΈ: Covering algebra, geometry, number theory, and more πŸ“˜.
- **Variants** πŸ”„: Positive or negative variants of the theorem that share the same or inverse truth value of the original theorem πŸ”€.