Truth or Mirage? Towards End-to-End Factuality Evaluation with LLM-OASIS
Abstract
LLM-Oasis, a comprehensive dataset for factuality evaluation, challenges state-of-the-art LLMs by requiring high accuracy in distinguishing factual from unfactual texts.
After the introduction of Large Language Models (LLMs), there have been substantial improvements in the performance of Natural Language Generation (NLG) tasks, including Text Summarization and Machine Translation. However, LLMs still produce outputs containing hallucinations, that is, content not grounded in factual information. Therefore, developing methods to assess the factuality of LLMs has become urgent. Indeed, resources for factuality evaluation have recently emerged. Although challenging, these resources face one or more of the following limitations: (i) they are tailored to a specific task or domain; (ii) they are limited in size, thereby preventing the training of new factuality evaluators; (iii) they are designed for simpler verification tasks, such as claim verification. To address these issues, we introduce LLM-Oasis, to the best of our knowledge the largest resource for training end-to-end factuality evaluators. LLM-Oasis is constructed by extracting claims from Wikipedia, falsifying a subset of these claims, and generating pairs of factual and unfactual texts. We then rely on human annotators to both validate the quality of our dataset and to create a gold standard test set for benchmarking factuality evaluation systems. Our experiments demonstrate that LLM-Oasis presents a significant challenge for state-of-the-art LLMs, with GPT-4o achieving up to 60% accuracy in our proposed end-to-end factuality evaluation task, highlighting its potential to drive future research in the field.
Community
LLM-Oasis is the largest dataset for evaluating factuality in Large Language Models (LLMs). It addresses key challenges in factuality resources, such as domain specificity, limited size, and simplistic verification tasks, by creating factual and falsified text pairs from Wikipedia, validated by human annotators. Experiments show that even state-of-the-art models like GPT-4o struggle with this task, achieving only ~60% accuracy, highlighting the dataset’s potential to advance research in factuality evaluation.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- FactLens: Benchmarking Fine-Grained Fact Verification (2024)
- Leveraging the Domain Adaptation of Retrieval Augmented Generation Models for Question Answering and Reducing Hallucination (2024)
- VERITAS: A Unified Approach to Reliability Evaluation (2024)
- Measuring the Groundedness of Legal Question-Answering Systems (2024)
- Improving Model Factuality with Fine-grained Critique-based Evaluator (2024)
- Investigating Factuality in Long-Form Text Generation: The Roles of Self-Known and Self-Unknown (2024)
- Rate, Explain and Cite (REC): Enhanced Explanation and Attribution in Automatic Evaluation by Large Language Models (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
 You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: 
@librarian-bot
	 recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 6
Browse 6 datasets citing this paperSpaces citing this paper 0
No Space linking this paper
 Pere-Lluis Huguet Cabot
							Pere-Lluis Huguet Cabot 
					 
					 
					 
					 
					 
						
 
						
 
						 
					