FineData

community

AI & ML interests

We release large pre-training datasets to accelerate open LLM development. Part of the Hugging Face Science team (hf.co/science)

Recent Activity

hynky new activity 2 days ago

HuggingFaceFW/finepdfs:The "file_path" data field appears to primarily contain cc-index paths rather than WARC paths.

hynky new activity 2 days ago

HuggingFaceFW/finepdfs:A Few Questions About the Implementation Details of the finepdfs Project

hynky new activity 11 days ago

HuggingFaceFW/finepdfs:Dataset broken by latest update?

View all activity

Papers

FineWeb2: One Pipeline to Scale Them All -- Adapting Pre-Training Data Processing to Every Language

The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale

View all Papers

HuggingFaceFW 's Spaces 6

README

FineWiki Viewer

Viewer to explore the finewiki dataset

FineWeb: decanting the web for the finest text data at scale

Generate high-quality text data for LLMs using FineWeb

Scaling FineWeb to 1000+ languages: Step 1: finding signal in 100s of evaluation tasks

Evaluate multilingual models using FineTasks

Tasks Explorer

Explore and analyze experiment results

Datasets Metrics Explorer

Launch an interactive demo interface