1
1
# zim_to_corpus
2
2
3
3
Scripts to extract the text from (mostly) Wikipedia pages from .zim archives.
4
+ The repository contains two components: ` zim_to_dir ` , a C++ program to extract
5
+ pages from zim archives; and a number of Python scripts to process its output
6
+ and convert the data into various formats (such as inputs for BERT, fasttext,
7
+ etc).
4
8
5
9
## ` zim_to_dir `
6
10
@@ -22,10 +26,69 @@ As of now, only English and Hungarian dumps are supported. However, "support"
22
26
for other languages can be added easily by modifying a very obvious line in
23
27
` zim_to_dir.cpp ` .
24
28
29
+ ### How to acquire
30
+
31
+ The ` zim_to_dir ` executable can be acquired in several ways:
32
+ - Downloading a release from
33
+ [ the ` zim_to_corpus ` repository] ( https://github.com/DavidNemeskey/zim_to_corpus )
34
+ - Building the docker image from the ` Dockerfile ` in the ` docker ` directory
35
+ - Compiling the code manually
36
+
37
+ ### Usage
38
+
39
+ #### The executable
40
+
41
+ The executable has two main arguments: ` -i ` is used to specify the input ` .zim `
42
+ file, and ` -o ` the output directory. The rest of the arguments can be used to
43
+ tune some of the aspects of the process; use the ` -h/--help ` option to list
44
+ them. An example run:
45
+
46
+ ```
47
+ zim_to_dir -i wikipedia_hu_all_mini.zim -o hu_mini/ -d 2000
48
+ ```
49
+
50
+ One thing worth mentioning: the number of threads the program uses to parse
51
+ records can be increased (from 4) to speed it up somewhat. However, since the
52
+ ` zim ` format is sequential, the whole task is, to a large extent, I/O bound;
53
+ because of this, the speed tops at a certain number of threads depending on the
54
+ storage type: slow HDDs max out around 4 threads, while fast SSDs can scale
55
+ even up to 24.
56
+
57
+ #### Docker image
58
+
59
+ The docker image can be used in two ways:
60
+ 1 . The ` zim_to_dir ` executable can be copied out of a container and used
61
+ as described above. For instance:
62
+ ```
63
+ $ docker create zim_to_dir
64
+ e892d6ff245b55e03e41384d1e7d2838babd944a8e31096b3677a05359f38aba
65
+ $ docker cp e892d6ff245b:/zim_to_dir .
66
+ $ docker rm e892d6ff245b
67
+ e892d6ff245b
68
+ ```
69
+ 2 . The container is also runnable and will run ` zim_to_dir ` by default. However,
70
+ in order for the container to see the input and output directories, they must
71
+ be mounted as volumes:
72
+ ```
73
+ docker run --rm --mount type=bind,source=/home/user/data/,target=/data zim_to_dir -i /data/wikipedia_hu_all_mini.zim -o /data/hu_mini/ -d 2000
74
+ ```
75
+
25
76
### Compiling the code
26
77
27
78
The script can be compiled with issuing the ` make ` command in the ` src `
28
- directory. There are a few caveats.
79
+ directory. There are a few caveats, and because of this, it is easier to
80
+ build the docker image, which compiles the source and all its dependencies:
81
+
82
+ ```
83
+ cd docker
84
+ docker build -t zim_to_dir .
85
+ ```
86
+
87
+ This method has the added benefit of not polluting the system with potentially
88
+ unneeded libraries and packages and it also works without ` root ` access.
89
+
90
+ For those who wish to compile the code manually, here we present the general
91
+ guidelines. Check out the ` Dockerfile ` for the detailed list of commands.
29
92
30
93
#### Compiler
31
94
@@ -51,16 +114,17 @@ git submodule update
51
114
52
115
Aside from these, two other libraries (and their sources or ` -dev ` packages) are required:
53
116
54
- - [ ` libzim ` ] ( https://github.com/openzim/libzim ) (also called Zimlib) to process
55
- the files. Libzim can be installed from the repositories of Linux
56
- distributions (` libzim-dev ` ), or compiled from source;
57
- - ` zlib ` , for compression (e.g. ` zlib1g-dev ` in Ubuntu).
117
+ 1 . ` zlib ` , for compression (e.g. ` zlib1g-dev ` in Ubuntu);
118
+ 2 . [ ` libzim ` ] ( https://github.com/openzim/libzim ) (also called Zimlib) to
119
+ process the files. Libzim can be installed from the repositories of Linux
120
+ distributions (` libzim-dev ` ), but e.g. Ubuntu only has version 4, so
121
+ depending on how recent is the file to process, it might have to be
122
+ compiled [ from source] ( https://github.com/openzim/libzim ) .
58
123
59
124
Note that some of the files in the Kiwix archives (most importantly, the
60
- English WP dump) require a recent version of libzim. A libzim version between
61
- 4.0 and 6.3 is recommended; note that the API changed in 7.0, and
62
- ` zim_to_dir ` is not yet compatible with it. The version in recent Ubuntu
63
- releases should work without problems.
125
+ English WP dump) require a fresh version of libzim. libzim version
126
+ 6.3 is recommended; note that the API changed in 7.0, and
127
+ ` zim_to_dir ` is not yet compatible with it.
64
128
65
129
### Troubleshooting
66
130
0 commit comments