Skip to content

Latest commit

 

History

History
33 lines (30 loc) · 1.54 KB

CITATION.md

File metadata and controls

33 lines (30 loc) · 1.54 KB

Academic Attribution

This repository implements a methodology described in the paper:
"The MiniPile Challenge for Data-Efficient Language Models" by Jean Kaddour.
This implementation is not peer-reviewed or endorsed by the author of the original MiniPile paper.

How to Cite This Work

If you use this implementation or build upon it, please cite both:

  1. The original MiniPile paper (Kaddour, Jean. 2023):
@article{kaddour2023minipile,
  title={The minipile challenge for data-efficient language models},
  author={Kaddour, Jean},
  journal={arXiv preprint arXiv:2304.08442},
  year={2023}
}
  1. This repository:
@software{koppelmann2025minicorpus,  
  author = {Koppelmann, Marcus},  
  title = {{MiniCorpus}: Investigating, reproducing, and improving MiniPile with PyTorch and HuggingFace},  
  year = {2025},
  publisher = {GitHub},  
  journal = {GitHub Repository},  
  url = {https://github.com/MK2112/minicorpus}
}

Relationship to Jean Kaddour’s Work

This repository is not affiliated with, endorsed by, or derived from code implementations by Jean Kaddour.
The implementation is an independent re-creation of the methodology described in the paper, with additional improvements, customizations and investigations.
Jean Kaddour’s original MiniPile dataset is not redistributed here; it is utilized within this implementation solely as a means of reference for benchmarking.