DeAR-8B-Reranker-Listwise-v1
Model Description
DeAR-8B-Reranker-Listwise-v1 is an 8B parameter listwise neural reranker that generates document rankings through text generation. Unlike pointwise models that score documents independently, this model considers multiple documents simultaneously and produces rankings with Chain-of-Thought reasoning.
Model Details
- Model Type: Listwise Reranker (Causal Language Model)
- Base Model: LLaMA-3.1-8B
- Parameters: 8 billion
- Training Method: Supervised Fine-tuning with Chain-of-Thought
- Training Data: DeAR-COT Dataset
- Training Framework: LLaMA-Factory
- Precision: BFloat16
Key Features
✅ Listwise Ranking: Considers inter-document dependencies
✅ Chain-of-Thought: Generates reasoning for ranking decisions
✅ State-of-the-Art: Best performance on NovelEval (90.97 NDCG@10)
✅ Flexible: Handles variable numbers of documents
✅ Interpretable: Provides explanations for rankings
Performance
| Benchmark | NDCG@10 | vs. GPT-4 |
|---|---|---|
| TREC DL19 | 77.91 | +2.32 |
| TREC DL20 | 75.63 | +5.07 |
| NovelEval | 90.97 | +3.09 |
| BEIR (Avg) | 46.8 | +2.3 |
Key Achievement: Outperforms GPT-4 on NovelEval by +3.09 points!
Usage
Quick Start
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
# Load model
model_path = "abdoelsayed/dear-8b-reranker-listwise-v1"
tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=True)
model = AutoModelForCausalLM.from_pretrained(
model_path,
torch_dtype=torch.bfloat16,
device_map="auto"
)
if tokenizer.pad_token is None:
tokenizer.pad_token = tokenizer.eos_token
# Prepare input
query = "When did Thomas Edison invent the light bulb?"
documents = [
"Lightning strike at Seoul National University",
"Thomas Edison tried to invent a device for car but failed",
"Coffee is good for diet",
"KEPCO fixes light problems",
"Thomas Edison invented the light bulb in 1879",
]
# Create listwise prompt
doc_list = "\n".join([f"[{i}] {doc}" for i, doc in enumerate(documents)])
prompt = f"""I will provide you with {len(documents)} passages, each indicated by a number identifier [].
Rank the passages based on their relevance to the search query: {query}.
{doc_list}
Search Query: {query}.
Rank the passages above based on their relevance to the search query. Output the ranking as a list of numbers."""
# Generate ranking
inputs = tokenizer(prompt, return_tensors="pt", truncation=True, max_length=2048)
inputs = {k: v.to(model.device) for k, v in inputs.items()}
with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens=50,
temperature=0.7,
do_sample=False,
pad_token_id=tokenizer.pad_token_id
)
ranking_text = tokenizer.decode(outputs[0][inputs['input_ids'].shape[1]:], skip_special_tokens=True)
print(f"Ranking: {ranking_text}")
# Output: [4] > [1] > [0] > [3] > [2]
Complete Reranking Pipeline
import torch
from typing import List
from transformers import AutoTokenizer, AutoModelForCausalLM
import re
class ListwiseReranker:
def __init__(self, model_path: str, device: str = "auto"):
self.tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=True)
self.model = AutoModelForCausalLM.from_pretrained(
model_path,
torch_dtype=torch.bfloat16,
device_map=device,
low_cpu_mem_usage=True
)
if self.tokenizer.pad_token is None:
self.tokenizer.pad_token = self.tokenizer.eos_token
def create_prompt(self, query: str, documents: List[str], max_doc_len: int = 300) -> str:
"""Create listwise ranking prompt."""
doc_list = "\n".join([f"[{i}] {doc[:max_doc_len]}" for i, doc in enumerate(documents)])
prompt = f"""I will provide you with {len(documents)} passages, each indicated by a number identifier [].
Rank the passages based on their relevance to the search query: {query}.
{doc_list}
Search Query: {query}.
Rank the passages above based on their relevance to the search query. Output the ranking as a list of numbers."""
return prompt
def parse_ranking(self, output_text: str, num_docs: int) -> List[int]:
"""Parse model output to extract ranking."""
# Extract numbers from output
numbers = re.findall(r'\[(\d+)\]', output_text)
numbers = [int(n) for n in numbers if int(n) < num_docs]
# Add missing documents at the end
ranked = numbers.copy()
for i in range(num_docs):
if i not in ranked:
ranked.append(i)
return ranked[:num_docs]
def rerank(
self,
query: str,
documents: List[str],
max_new_tokens: int = 50,
temperature: float = 0.7
) -> List[int]:
"""
Rerank documents for a query.
Args:
query: Search query
documents: List of document texts
max_new_tokens: Max tokens to generate
temperature: Sampling temperature
Returns:
List of document indices ranked by relevance
"""
prompt = self.create_prompt(query, documents)
inputs = self.tokenizer(
prompt,
return_tensors="pt",
truncation=True,
max_length=2048
)
inputs = {k: v.to(self.model.device) for k, v in inputs.items()}
with torch.no_grad():
outputs = self.model.generate(
**inputs,
max_new_tokens=max_new_tokens,
temperature=temperature,
do_sample=False,
pad_token_id=self.tokenizer.pad_token_id
)
output_text = self.tokenizer.decode(
outputs[0][inputs['input_ids'].shape[1]:],
skip_special_tokens=True
)
ranking = self.parse_ranking(output_text, len(documents))
return ranking
# Example usage
reranker = ListwiseReranker("abdoelsayed/dear-8b-reranker-listwise-v1")
query = "What are the health benefits of green tea?"
documents = [
"Green tea is a popular beverage in Asian countries.",
"Studies show green tea contains antioxidants that may reduce inflammation.",
"Coffee is another caffeinated drink consumed worldwide.",
"Green tea has been linked to improved brain function and fat loss.",
"The weather today is sunny and warm.",
]
ranking = reranker.rerank(query, documents)
print(f"Ranked indices: {ranking}")
# Output: [1, 3, 0, 2, 4]
# Display ranked documents
for rank, idx in enumerate(ranking, 1):
print(f"{rank}. {documents[idx]}")
Training Details
Training Data
- Dataset: DeAR-COT
- Format: Instruction-following with ranking outputs
Training Configuration
model_name: meta-llama/Llama-3.1-8B
task_type: sft
training_method: listwise_ranking
framework: LLaMA-Factory
hyperparameters:
learning_rate: 1e-5
batch_size: 4
gradient_accumulation: 4
epochs: 2
max_length: 2048
warmup_ratio: 0.1
weight_decay: 0.01
optimizer: adamw_torch
lr_scheduler: cosine
distributed:
method: torch.distributed.run
num_gpus: 4
deepspeed: zero2
Hardware
- GPUs: 4x NVIDIA A100 (80GB)
- Training Time: ~30 hours
- Framework: LLaMA-Factory with DeepSpeed
- Memory Usage: ~70GB per GPU
Prompt Format
Training Format:
I will provide you with {N} passages, each indicated by a number identifier [].
Rank the passages based on their relevance to the search query: {query}.
[0] {doc_0}
[1] {doc_1}
...
[N-1] {doc_N-1}
Search Query: {query}.
Rank the passages above based on their relevance to the search query. Output the ranking as a list of numbers.
Answer: [most_relevant] > [second] > ... > [least_relevant]
Evaluation Results
TREC Deep Learning
| Method | DL19 (NDCG@10) | DL20 (NDCG@10) | Average |
|---|---|---|---|
| BM25 | 50.58 | 47.96 | 49.27 |
| RankGPT-4 | 75.59 | 70.56 | 73.08 |
| DeAR-L-8B | 77.91 | 75.63 | 76.77 |
NovelEval-2306 (Novel Query Generalization)
| Method | NDCG@1 | NDCG@5 | NDCG@10 | Average |
|---|---|---|---|---|
| BM25 | 33.33 | 45.96 | 55.77 | 45.02 |
| RankGPT-4 | 85.71 | 87.49 | 90.45 | 87.88 |
| DeAR-L-8B | 92.86 | 88.04 | 92.01 | 90.97 |
🏆 +3.09 points better than GPT-4 on NovelEval!
BEIR Benchmark
| Dataset | NDCG@10 |
|---|---|
| MS MARCO | 70.2 |
| NQ | 54.1 |
| HotpotQA | 64.5 |
| FiQA | 49.3 |
| ArguAna | 62.1 |
| SciFact | 76.2 |
| TREC-COVID | 88.4 |
| NFCorpus | 40.6 |
| Average | 46.8 |
Efficiency Analysis
| Metric | Value |
|---|---|
| Inference Time (20 docs) | 11.16s |
| Throughput | ~1.8 docs/sec |
| GPU Memory (inference) | 22GB |
| Model Size (BF16) | 16GB |
Comparison with Other Methods:
- 2.2x faster than RankGPT-4 (24.5s)
- 1.9x faster than RankZephyr (21.6s)
- Similar performance with much better efficiency
Advantages over Pointwise Models
| Aspect | Pointwise | Listwise (This Model) |
|---|---|---|
| Document Interaction | ❌ Independent | ✅ Considers relationships |
| Reasoning | ❌ None | ✅ Chain-of-Thought |
| Novel Queries | Good | ✅ Excellent (+3-5 NDCG@10) |
| Interpretability | ❌ Score only | ✅ Reasoning provided |
| Speed | ✅ Very Fast (2.2s) | Moderate (11.2s) |
Model Architecture
Input: Listwise Prompt with Query + Multiple Documents
↓
LLaMA-3.1-8B Decoder
↓
Auto-regressive Generation
↓
Output: "[4] > [1] > [0] > [3] > [2]"
↓
Parse to Ranking: [4, 1, 0, 3, 2]
When to Use This Model
Best for:
- ✅ Novel/complex queries requiring reasoning
- ✅ Tasks where interpretability matters
- ✅ Small candidate sets (<100 documents)
- ✅ Research and analysis applications
Consider pointwise models for:
- ❌ Large-scale reranking (1000s of docs)
- ❌ Real-time, low-latency applications
- ❌ When reasoning is not needed
Limitations
- Inference Speed: Slower than pointwise models (~5x)
- Document Count: Limited by context length (~20-50 docs optimal)
- Parsing Errors: May occasionally generate malformed rankings
- Cost: Higher computational cost for generation
- Language: English only
Bias and Ethical Considerations
- Position Bias: May favor documents in certain positions
- Training Data Bias: Inherits biases from CoT annotations
- Reasoning Artifacts: Generated explanations may contain hallucinations
- Fairness: Should be evaluated for fairness in your domain
Related Models
DeAR Listwise:
- DeAR-8B-Listwise-LoRA - LoRA adapter version
DeAR Pointwise (8B):
Resources:
Citation
@article{abdallah2025dear,
title={DeAR: Dual-Stage Document Reranking with Reasoning Agents via LLM Distillation},
author={Abdallah, Abdelrahman and Mozafari, Jamshid and Piryani, Bhawna and Jatowt, Adam},
journal={arXiv preprint arXiv:2508.16998},
year={2025}
}
License
MIT License
More Information
- GitHub: DataScienceUIBK/DeAR-Reranking
- Paper: arXiv:2508.16998
- Collection: DeAR Models
- Downloads last month
- 4
Model tree for abdoelsayed/dear-8b-reranker-listwise-v1
Base model
meta-llama/Llama-3.1-8B