Skip to content

Commit aa7e7ed

Browse files
committed
Added the docker details to README.md.
1 parent f7b9085 commit aa7e7ed

File tree

1 file changed

+63
-9
lines changed

1 file changed

+63
-9
lines changed

README.md

+63-9
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,10 @@
11
# zim_to_corpus
22

33
Scripts to extract the text from (mostly) Wikipedia pages from .zim archives.
4+
The repository contains two components: `zim_to_dir`, a C++ program to extract
5+
pages from zim archives; and a number of Python scripts to process its output
6+
and convert the data into various formats (such as inputs for BERT, fasttext,
7+
etc).
48

59
## `zim_to_dir`
610

@@ -22,10 +26,59 @@ As of now, only English and Hungarian dumps are supported. However, "support"
2226
for other languages can be added easily by modifying a very obvious line in
2327
`zim_to_dir.cpp`.
2428

29+
### How to acquire
30+
31+
The `zim_to_dir` executable can be acquired in several ways:
32+
- Downloading a release from
33+
[the `zim_to_corpus` repository](https://github.com/DavidNemeskey/zim_to_corpus)
34+
- Using the docker image, either by downloading it from the Docker Hub or
35+
building it from the `Dockerfile` in the `docker` directory
36+
- Compiling the code manually
37+
38+
### Usage
39+
40+
#### The executable
41+
42+
The executable has two main arguments: `-i` is used to specify the input `.zim`
43+
file, and `-o` the output directory. The rest of the arguments can be used to
44+
tune some of the aspects of the process; use the `-h/--help` option to list
45+
them. An example run:
46+
47+
```
48+
zim_to_dir -i wikipedia_hu_all_mini.zim -o hu_mini/ -d 2000
49+
```
50+
51+
One thing worth mentioning: the number of threads the program uses to parse
52+
records can be increased to speed it up somewhat. However, since the `zim`
53+
format is inherently sequential, the speed tops at around 4 threads (might
54+
depend on the storage).
55+
56+
#### Docker image
57+
58+
The docker image can be used in two ways:
59+
1. The `zim_to_dir` executable can be copied out of a container and used
60+
as described above. For instance:
61+
```
62+
$ docker create zim_to_dir
63+
e892d6ff245b55e03e41384d1e7d2838babd944a8e31096b3677a05359f38aba
64+
$ docker cp e892d6ff245b:/zim_to_dir .
65+
$ docker rm e892d6ff245b
66+
e892d6ff245b
67+
```
68+
2. The container is also runnable and will run `zim_to_dir` by default. However,
69+
in order for the container to see the input and output directories, they must
70+
be mounted as volumes:
71+
```
72+
docker run --rm --mount type=bind,source=/home/user/data/,target=/data zim_to_dir -i /data/wikipedia_hu_all_mini.zim -o /data/hu_mini/ -d 2000
73+
```
74+
2575
### Compiling the code
2676

2777
The script can be compiled with issuing the `make` command in the `src`
28-
directory. There are a few caveats.
78+
directory. There are a few caveats, and because of this, it is easier to
79+
build the docker image, which compiles the source and all its dependencies.
80+
Here we present the general guidelines; check out the `Dockerfile` for the
81+
details.
2982

3083
#### Compiler
3184

@@ -51,16 +104,17 @@ git submodule update
51104

52105
Aside from these, two other libraries (and their sources or `-dev` packages) are required:
53106

54-
- [`libzim`](https://github.com/openzim/libzim) (also called Zimlib) to process
55-
the files. Libzim can be installed from the repositories of Linux
56-
distributions (`libzim-dev`), or compiled from source;
57-
- `zlib`, for compression (e.g. `zlib1g-dev` in Ubuntu).
107+
1. `zlib`, for compression (e.g. `zlib1g-dev` in Ubuntu);
108+
2. [`libzim`](https://github.com/openzim/libzim) (also called Zimlib) to
109+
process the files. Libzim can be installed from the repositories of Linux
110+
distributions (`libzim-dev`), but e.g. Ubuntu only has version 4, so
111+
depending on how recent is the file to process, it might have to be
112+
compiled [from source](https://github.com/openzim/libzim).
58113

59114
Note that some of the files in the Kiwix archives (most importantly, the
60-
English WP dump) require a recent version of libzim. A libzim version between
61-
4.0 and 6.3 is recommended; note that the API changed in 7.0, and
62-
`zim_to_dir` is not yet compatible with it. The version in recent Ubuntu
63-
releases should work without problems.
115+
English WP dump) require a fresh version of libzim. libzim version
116+
6.3 is recommended; note that the API changed in 7.0, and
117+
`zim_to_dir` is not yet compatible with it.
64118

65119
### Troubleshooting
66120

0 commit comments

Comments
 (0)