Update README.md

c3bda66 verified 5 months ago

3.68 kB

	---
	license: mit
	---

	# DeepTheorem: Advancing LLM Reasoning for Theorem Proving Through Natural Language and Reinforcement Learning 🚀

	Welcome to the GitHub repository for DeepTheorem 🎉, a comprehensive framework for enhancing large language model (LLM) mathematical reasoning through informal, natural language-based theorem proving. This project introduces a novel approach to automated theorem proving (ATP) by leveraging the informal reasoning strengths of LLMs, moving beyond traditional formal proof systems 🌟.

	## Overview 📖

	<p align="center">
	<img src="frontpage.png" width="800" />
	</p>

	Theorem proving is a critical benchmark for evaluating complex reasoning in LLMs 🧠. However, formal proof systems often misalign with the informal, natural language knowledge LLMs acquire during pre-training. DeepTheorem addresses this gap by introducing:

	1. A Large-Scale Theorem Proving Dataset 📊:
	- Contains high-quality, IMO-level informal theorems and proofs across diverse mathematical domains 📚.
	- Rigorously annotated for correctness, difficulty, and topic categories ✅.
	- Includes systematically constructed verifiable theorem variants for robust evaluation 🔍.

	2. RL-Zero 🤖:
	- A novel reinforcement learning strategy tailored for informal theorem proving ⚙️.
	- Utilizes verified theorem variants to incentivize robust mathematical inference 💡.

	Deeptheorem demonstrates its superior performance over both opensource and commercial models (i.e Gemini and o1)

	## Performance 🚀
	Deeptheorem achieves the #Rank 5 position along all the commerical models and open source models.

	\| Model \| FIMO \| \| HMMT \| \| Putnam \| \| Avg.(\#Rank) \| \|
	\| :--------------------- \| :------: \| :-----: \| :------: \| :-----: \| :--------: \| :-----: \| :------: \| :-----: \|
	\| \| out. \| proc. \| out. \| proc. \| out. \| proc. \| out. \| proc. \|
	\| Gemini2\.5-Pro \| 57\.14 \| 54\.06 \| 57\.63 \| 49\.82 \| 64\.58 \| 58\.75 \| 59\.78(\#2) \| 54\.21(\#3) \|
	\| o1-mini \| 60\.32 \| 55\.23 \| 35\.59 \| 30\.90 \| 61\.46 \| 52\.88 \| 52\.46(\#4) \| 46\.34(\#4) \|
	\| o1 \| 66\.67 \| 61\.00 \| 47\.46 \| 47\.30 \| 62\.50 \| 57\.55 \| 58\.88(\#3) \| 55\.28(\#2) \|
	\| o3-mini \| 80\.95 \| 77\.61 \| 45\.76 \| 43\.47 \| 78\.12 \| 75\.12 \| 68\.28(\#1) \| 65\.40(\#1) \|
	\| *DeepTheorem-RL-7B \| 55\.56 \| 39\.07 \| 28\.81 \| 20\.85 \| 57\.29 \| 42\.20 \| 47\.22(\#5) \| 34\.04(\#5) \|
	\| *DeepTheorem-RL-3B \| 38\.10 \| 23\.39 \| 25\.42 \| 13\.56 \| 52\.08 \| 33\.84 \| 38\.53 \| 23.60 \|
	\| *DeepTheorem-RL-1.5B \| 31\.75 \| 15\.23 \| 23\.73 \| 10\.15 \| 52\.08 \| 22\.79 \| 35\.85 \| 16.06 \|


	## DeepTheorem Dataset 📊

	The DeepTheorem dataset comprises 121K IMO-level informal theorems and proofs spanning diverse mathematical domains 🌐. Each theorem-proof pair is rigorously annotated for:
	- o3-mini Proofs 🖋️: Ensuring mathematical accuracy through proofs generated or verified by the o3-mini model ✅.
	- Truth Value 🔍: The truth value of the theorem extracted from the o3-mini proofs, indicating whether the theorem is true or false ✔️.
	- Difficulty 🎚️: Categorized by complexity to suit various LLM capabilities 🧩.
	- Topic Categories 🗂️: Covering algebra, geometry, number theory, and more 📘.
	- Variants 🔄: Positive or negative variants of the theorem that share the same or inverse truth value of the original theorem 🔀.

	---
	license: mit
	---

	# DeepTheorem: Advancing LLM Reasoning for Theorem Proving Through Natural Language and Reinforcement Learning 🚀

	Welcome to the GitHub repository for DeepTheorem 🎉, a comprehensive framework for enhancing large language model (LLM) mathematical reasoning through informal, natural language-based theorem proving. This project introduces a novel approach to automated theorem proving (ATP) by leveraging the informal reasoning strengths of LLMs, moving beyond traditional formal proof systems 🌟.

	## Overview 📖

	<p align="center">
	<img src="frontpage.png" width="800" />
	</p>

	Theorem proving is a critical benchmark for evaluating complex reasoning in LLMs 🧠. However, formal proof systems often misalign with the informal, natural language knowledge LLMs acquire during pre-training. DeepTheorem addresses this gap by introducing:

	1. A Large-Scale Theorem Proving Dataset 📊:
	- Contains high-quality, IMO-level informal theorems and proofs across diverse mathematical domains 📚.
	- Rigorously annotated for correctness, difficulty, and topic categories ✅.
	- Includes systematically constructed verifiable theorem variants for robust evaluation 🔍.

	2. RL-Zero 🤖:
	- A novel reinforcement learning strategy tailored for informal theorem proving ⚙️.
	- Utilizes verified theorem variants to incentivize robust mathematical inference 💡.

	Deeptheorem demonstrates its superior performance over both opensource and commercial models (i.e Gemini and o1)

	## Performance 🚀
	Deeptheorem achieves the #Rank 5 position along all the commerical models and open source models.

	\| Model \| FIMO \| \| HMMT \| \| Putnam \| \| Avg.(\#Rank) \| \|
	\| :--------------------- \| :------: \| :-----: \| :------: \| :-----: \| :--------: \| :-----: \| :------: \| :-----: \|
	\| \| out. \| proc. \| out. \| proc. \| out. \| proc. \| out. \| proc. \|
	\| Gemini2\.5-Pro \| 57\.14 \| 54\.06 \| 57\.63 \| 49\.82 \| 64\.58 \| 58\.75 \| 59\.78(\#2) \| 54\.21(\#3) \|
	\| o1-mini \| 60\.32 \| 55\.23 \| 35\.59 \| 30\.90 \| 61\.46 \| 52\.88 \| 52\.46(\#4) \| 46\.34(\#4) \|
	\| o1 \| 66\.67 \| 61\.00 \| 47\.46 \| 47\.30 \| 62\.50 \| 57\.55 \| 58\.88(\#3) \| 55\.28(\#2) \|
	\| o3-mini \| 80\.95 \| 77\.61 \| 45\.76 \| 43\.47 \| 78\.12 \| 75\.12 \| 68\.28(\#1) \| 65\.40(\#1) \|
	\| *DeepTheorem-RL-7B \| 55\.56 \| 39\.07 \| 28\.81 \| 20\.85 \| 57\.29 \| 42\.20 \| 47\.22(\#5) \| 34\.04(\#5) \|
	\| *DeepTheorem-RL-3B \| 38\.10 \| 23\.39 \| 25\.42 \| 13\.56 \| 52\.08 \| 33\.84 \| 38\.53 \| 23.60 \|
	\| *DeepTheorem-RL-1.5B \| 31\.75 \| 15\.23 \| 23\.73 \| 10\.15 \| 52\.08 \| 22\.79 \| 35\.85 \| 16.06 \|


	## DeepTheorem Dataset 📊

	The DeepTheorem dataset comprises 121K IMO-level informal theorems and proofs spanning diverse mathematical domains 🌐. Each theorem-proof pair is rigorously annotated for:
	- o3-mini Proofs 🖋️: Ensuring mathematical accuracy through proofs generated or verified by the o3-mini model ✅.
	- Truth Value 🔍: The truth value of the theorem extracted from the o3-mini proofs, indicating whether the theorem is true or false ✔️.
	- Difficulty 🎚️: Categorized by complexity to suit various LLM capabilities 🧩.
	- Topic Categories 🗂️: Covering algebra, geometry, number theory, and more 📘.
	- Variants 🔄: Positive or negative variants of the theorem that share the same or inverse truth value of the original theorem 🔀.