CogniSQL: Lightweight Reinforced Reasoning for Efficient SQL Generation
Overview
Welcome to CogniSQL! This organization hosts research datasets and resources for advancing Text-to-SQL generation through reinforcement learning. Our work focuses on building efficient, execution-aligned SQL generation systems that scale effectively while maintaining accuracy on complex database queries.
Research Focus
CogniSQL develops novel approaches to translate natural language into SQL (Text-to-SQL) using:
- Reinforcement Learning (RL) Frameworks: Lightweight reward signals based on execution correctness and format-tag compliance
- Efficient Training: State-of-the-art performance on a smaller 7B parameter backbone (compared to 236B+ models)
- Execution-Aligned Generation: Direct optimization for producing correct, executable SQL without intermediate supervision
- Interpretable Reasoning: Multi-path reasoning traces for better understanding of model behavior
Key Achievements
- State-of-the-Art Results: Outperforms SFT CodeS-7B, DeepSeek-Coder 236B, and Mistral 123B on BIRD benchmark
- Efficient Training: Trained on just 4 NVIDIA A100 GPUs (40GB VRAM each)
- Resource-Constrained Deployment: Enables practical Text-to-SQL systems for real-world applications
- Open Research: Two curated datasets released for community research
Datasets
This organization maintains two high-quality datasets:
- Reasoning_Traces: 5,024 reasoning traces with varying context lengths for interpretable SQL generation
- Positive_Sample_Corpus: 36,356 weakly supervised queries, each annotated with six semantically diverse reasoning paths
Both datasets are designed to support research in efficient and interpretable Text-to-SQL modeling.
Citation
If you use our datasets or research, please cite the following paper:
@article{gajjar2025cognisql,
title={CogniSQL-R1-Zero: Lightweight Reinforced Reasoning for Efficient SQL Generation},
author={Gajjar, Kushal and Sikchi, Harshit and Gautam, Arpit Singh and Hammons, Marc and Jha, Saurabh},
journal={arXiv preprint arXiv:2507.06013},
year={2025},
url={https://arxiv.org/abs/2507.06013}
}
arXiv: 2507.06013
Research Team
- Kushal Gajjar
- Harshit Sikchi
- Arpit Singh Gautam
- Marc Hammons
- Saurabh Jha
Applications
Our work enables:
- Database query systems that understand natural language
- Efficient SQL generation in resource-constrained environments
- Interpretable AI systems with transparent reasoning traces
- Production-grade Text-to-SQL pipelines
License
Please refer to individual dataset cards for specific licensing information.
Related Links