1
1
# zim_to_corpus
2
2
3
3
Scripts to extract the text from (mostly) Wikipedia pages from .zim archives.
4
+ The repository contains two components: ` zim_to_dir ` , a C++ program to extract
5
+ pages from zim archives; and a number of Python scripts to process its output
6
+ and convert the data into various formats (such as inputs for BERT, fasttext,
7
+ etc).
4
8
5
9
## ` zim_to_dir `
6
10
@@ -22,10 +26,59 @@ As of now, only English and Hungarian dumps are supported. However, "support"
22
26
for other languages can be added easily by modifying a very obvious line in
23
27
` zim_to_dir.cpp ` .
24
28
29
+ ### How to acquire
30
+
31
+ The ` zim_to_dir ` executable can be acquired in several ways:
32
+ - Downloading a release from
33
+ [ the ` zim_to_corpus ` repository] ( https://github.com/DavidNemeskey/zim_to_corpus )
34
+ - Using the docker image, either by downloading it from the Docker Hub or
35
+ building it from the ` Dockerfile ` in the ` docker ` directory
36
+ - Compiling the code manually
37
+
38
+ ### Usage
39
+
40
+ #### The executable
41
+
42
+ The executable has two main arguments: ` -i ` is used to specify the input ` .zim `
43
+ file, and ` -o ` the output directory. The rest of the arguments can be used to
44
+ tune some of the aspects of the process; use the ` -h/--help ` option to list
45
+ them. An example run:
46
+
47
+ ```
48
+ zim_to_dir -i wikipedia_hu_all_mini.zim -o hu_mini/ -d 2000
49
+ ```
50
+
51
+ One thing worth mentioning: the number of threads the program uses to parse
52
+ records can be increased to speed it up somewhat. However, since the ` zim `
53
+ format is inherently sequential, the speed tops at around 4 threads (might
54
+ depend on the storage).
55
+
56
+ #### Docker image
57
+
58
+ The docker image can be used in two ways:
59
+ 1 . The ` zim_to_dir ` executable can be copied out of a container and used
60
+ as described above. For instance:
61
+ ```
62
+ $ docker create zim_to_dir
63
+ e892d6ff245b55e03e41384d1e7d2838babd944a8e31096b3677a05359f38aba
64
+ $ docker cp e892d6ff245b:/zim_to_dir .
65
+ $ docker rm e892d6ff245b
66
+ e892d6ff245b
67
+ ```
68
+ 2 . The container is also runnable and will run ` zim_to_dir ` by default. However,
69
+ in order for the container to see the input and output directories, they must
70
+ be mounted as volumes:
71
+ ```
72
+ docker run --rm --mount type=bind,source=/home/user/data/,target=/data zim_to_dir -i /data/wikipedia_hu_all_mini.zim -o /data/hu_mini/ -d 2000
73
+ ```
74
+
25
75
### Compiling the code
26
76
27
77
The script can be compiled with issuing the ` make ` command in the ` src `
28
- directory. There are a few caveats.
78
+ directory. There are a few caveats, and because of this, it is easier to
79
+ build the docker image, which compiles the source and all its dependencies.
80
+ Here we present the general guidelines; check out the ` Dockerfile ` for the
81
+ details.
29
82
30
83
#### Compiler
31
84
@@ -51,16 +104,17 @@ git submodule update
51
104
52
105
Aside from these, two other libraries (and their sources or ` -dev ` packages) are required:
53
106
54
- - [ ` libzim ` ] ( https://github.com/openzim/libzim ) (also called Zimlib) to process
55
- the files. Libzim can be installed from the repositories of Linux
56
- distributions (` libzim-dev ` ), or compiled from source;
57
- - ` zlib ` , for compression (e.g. ` zlib1g-dev ` in Ubuntu).
107
+ 1 . ` zlib ` , for compression (e.g. ` zlib1g-dev ` in Ubuntu);
108
+ 2 . [ ` libzim ` ] ( https://github.com/openzim/libzim ) (also called Zimlib) to
109
+ process the files. Libzim can be installed from the repositories of Linux
110
+ distributions (` libzim-dev ` ), but e.g. Ubuntu only has version 4, so
111
+ depending on how recent is the file to process, it might have to be
112
+ compiled [ from source] ( https://github.com/openzim/libzim ) .
58
113
59
114
Note that some of the files in the Kiwix archives (most importantly, the
60
- English WP dump) require a recent version of libzim. A libzim version between
61
- 4.0 and 6.3 is recommended; note that the API changed in 7.0, and
62
- ` zim_to_dir ` is not yet compatible with it. The version in recent Ubuntu
63
- releases should work without problems.
115
+ English WP dump) require a fresh version of libzim. libzim version
116
+ 6.3 is recommended; note that the API changed in 7.0, and
117
+ ` zim_to_dir ` is not yet compatible with it.
64
118
65
119
### Troubleshooting
66
120
0 commit comments