Fineweb-zhtw

Installation

Clone repo with submodules

git clone --recurse-submodules https://github.com/mtkresearch/fineweb-zhtw.git

Install requirements

pip install -r requirements.txt

To Run

prepare data for pipeline demonstration (download 2 WARC files)

bash scripts/get_data_example.sh

run cleaning

bash scripts/map_warc_to_zhtw_text.sh

Results would be under data/parsed/{DUMP}/5_zhtwplus/output

If you like our work, please cite

@misc{lin2024finewebzhtwscalablecurationtraditional,
      title={FineWeb-zhtw: Scalable Curation of Traditional Chinese Text Data from the Web}, 
      author={Cheng-Wei Lin and Wan-Hsuan Hsieh and Kai-Xin Guan and Chan-Jan Hsu and Chia-Chen Kuo and Chuan-Lin Lai and Chung-Wei Chung and Ming-Jen Wang and Da-Shan Shiu},
      year={2024},
      eprint={2411.16387},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2411.16387}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
data		data
datatrove-zhtw @ 747adf3		datatrove-zhtw @ 747adf3
scripts		scripts
src		src
utils		utils
.gitmodules		.gitmodules
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Fineweb-zhtw

Installation

To Run

About

Releases

Packages

Languages

mtkresearch/fineweb-zhtw

Folders and files

Latest commit

History

Repository files navigation

Fineweb-zhtw

Installation

To Run

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages