Announcing the Common Pile and Comma v0.1
We are happy to announce the release of the Common Pile v0.1, an eight terabyte dataset of openly licensed and public domain text. The Common Pile comprises text from 30 distinct sources, covering a wide variety of domains including research papers, code, books, educational materials, audio transcripts, governmental text, and more. One of our goals in creating the Common Pile is to answer the question: Is it possible to train performant language models without using unlicensed text? We answer in the affirmative by training two 7 billion parameter LLMs on text from the Common Pile: Comma v0.1-1T and Comma v0.1-2T, trained on 1 and 2 trillion tokens respectively. Both models attain competitive performance to LLMs trained on unlicensed text with similar computational budgets, such as Llama 1 and 2 7B. Along with model checkpoints, we also release the filtered and rebalanced dataset used to train the Comma v0.1 models. In addition, all of the code used to prepare our data is available on our GitHub repository. You can read more about our dataset and models in our paper. As indicated by the "v0.1" designation, we consider our work to be a first step on the path towards a more ethical language model ecosystem, and we have lots of future work planned. If you're interested in supporting our efforts or contributing, please open an issue on GitHub or get in touch!