Spaces:
Running
Running
File size: 1,136 Bytes
02e3168 9dda2ee 02e3168 9dda2ee f9be8ba 0838bac 3df1f60 9dda2ee |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
---
title: README
emoji: π
colorFrom: purple
colorTo: yellow
sdk: static
pinned: false
---
# The Common Pile
We are a group of researchers working together to collect and curate openly licensed and public domain data for training large language models.
So far, we have released:
- [The Common Pile v0.1](https://huggingface.co/collections/common-pile/common-pile-v01-68307d37df48e36f02717f21), an 8 TB dataset of text from over 30 diverse sources
- Our paper: [The Common Pile v0.1: An 8TB Dataset of Public Domain and Openly Licensed Text](https://huggingface.co/papers/2506.05209)
- [Comma v0.1-1T](https://huggingface.co/common-pile/comma-v0.1-1t) and [Comma v0.1-2T](https://huggingface.co/common-pile/comma-v0.1-2t), 7B parameter LLMs trained on text from the Common Pile v0.1
- The [training dataset](https://huggingface.co/datasets/common-pile/comma_v0.1_training_dataset) used to train the Comma v0.1 models
- Our [code](https://github.com/r-three/common-pile/) for collecting data from each source
If you're interested in contributing, please [open an issue on GitHub](https://github.com/r-three/common-pile/issues/new)! |