This repository implements a methodology described in the paper:
"The MiniPile Challenge for Data-Efficient Language Models" by Jean Kaddour.
This implementation is not peer-reviewed or endorsed by the author of the original MiniPile paper.
If you use this implementation or build upon it, please cite both:
- The original MiniPile paper (Kaddour, Jean. 2023):
@article{kaddour2023minipile,
title={The minipile challenge for data-efficient language models},
author={Kaddour, Jean},
journal={arXiv preprint arXiv:2304.08442},
year={2023}
}
- This repository:
@software{koppelmann2025minicorpus,
author = {Koppelmann, Marcus},
title = {{MiniCorpus}: Investigating, reproducing, and improving MiniPile with PyTorch and HuggingFace},
year = {2025},
publisher = {GitHub},
journal = {GitHub Repository},
url = {https://github.com/MK2112/minicorpus}
}
This repository is not affiliated with, endorsed by, or derived from code implementations by Jean Kaddour.
The implementation is an independent re-creation of the methodology described in the paper, with additional improvements, customizations and investigations.
Jean Kaddour’s original MiniPile dataset is not redistributed here; it is utilized within this implementation solely as a means of reference for benchmarking.