--- title: RAG Benchmark Leaderboard emoji: 📚 colorFrom: gray colorTo: purple sdk: gradio sdk_version: 5.4.0 app_file: app.py pinned: false --- # RAG Benchmark Leaderboard An interactive leaderboard for comparing and visualizing the performance of RAG (Retrieval-Augmented Generation) systems. ## Features - **Version Comparison**: Compare model performances across different versions of the benchmark dataset - **Interactive Radar Charts**: Visualize generative and retrieval metrics - **Customizable Views**: Filter and sort models based on different criteria - **Easy Submission**: Simple API for submitting your model results ## Installation ```bash pip install -r requirements.txt ``` ## Running the Leaderboard ```bash cd leaderboard python app.py ``` This will start a Gradio server, and you can access the leaderboard in your browser at http://localhost:7860. ## Submitting Results To submit your results to the leaderboard, use the provided API: ```python from rag_benchmark import RAGBenchmark # Initialize the benchmark benchmark = RAGBenchmark(version="2.0") # Use the latest version # Run evaluation results = benchmark.evaluate( model_name="Your Model Name", embedding_model="your-embedding-model", retriever_type="dense", # Options: dense, sparse, hybrid retrieval_config={"top_k": 3} ) # Submit results benchmark.submit_results(results) ``` ## Data Format The results.json file has the following structure: ```json { "items": { "1.0": { // Dataset version "model1": { // Submission ID "model_name": "Model Name", "timestamp": "2025-03-20T12:00:00", "config": { "embedding_model": "embedding-model-name", "retriever_type": "dense", "retrieval_config": { "top_k": 3 } }, "metrics": { "retrieval": { "hit_rate": 0.82, "mrr": 0.65, "precision": 0.78 }, "generation": { "rouge1": 0.72, "rouge2": 0.55, "rougeL": 0.68 } } } } }, "last_version": "2.0", "n_questions": "1000" } ``` ## License MIT # RAG Evaluation Leaderboard This leaderboard tracks different RAG (Retrieval-Augmented Generation) implementations and their performance metrics. ## Metrics Tracked ### Retrieval Metrics - Hit Rate: Proportion of relevant documents retrieved - MRR (Mean Reciprocal Rank): Position of first relevant document ### Generation Metrics - ROUGE-1: Unigram overlap - ROUGE-2: Bigram overlap - ROUGE-L: Longest common subsequence