Skip to content

Commit e369ae4

Browse files
Merge pull request #9 from DavidNemeskey/docker
Compilation via a Docker image
2 parents 52d880a + 3f8a1c1 commit e369ae4

File tree

4 files changed

+114
-11
lines changed

4 files changed

+114
-11
lines changed

README.md

+73-9
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,10 @@
11
# zim_to_corpus
22

33
Scripts to extract the text from (mostly) Wikipedia pages from .zim archives.
4+
The repository contains two components: `zim_to_dir`, a C++ program to extract
5+
pages from zim archives; and a number of Python scripts to process its output
6+
and convert the data into various formats (such as inputs for BERT, fasttext,
7+
etc).
48

59
## `zim_to_dir`
610

@@ -22,10 +26,69 @@ As of now, only English and Hungarian dumps are supported. However, "support"
2226
for other languages can be added easily by modifying a very obvious line in
2327
`zim_to_dir.cpp`.
2428

29+
### How to acquire
30+
31+
The `zim_to_dir` executable can be acquired in several ways:
32+
- Downloading a release from
33+
[the `zim_to_corpus` repository](https://github.com/DavidNemeskey/zim_to_corpus)
34+
- Building the docker image from the `Dockerfile` in the `docker` directory
35+
- Compiling the code manually
36+
37+
### Usage
38+
39+
#### The executable
40+
41+
The executable has two main arguments: `-i` is used to specify the input `.zim`
42+
file, and `-o` the output directory. The rest of the arguments can be used to
43+
tune some of the aspects of the process; use the `-h/--help` option to list
44+
them. An example run:
45+
46+
```
47+
zim_to_dir -i wikipedia_hu_all_mini.zim -o hu_mini/ -d 2000
48+
```
49+
50+
One thing worth mentioning: the number of threads the program uses to parse
51+
records can be increased (from 4) to speed it up somewhat. However, since the
52+
`zim` format is sequential, the whole task is, to a large extent, I/O bound;
53+
because of this, the speed tops at a certain number of threads depending on the
54+
storage type: slow HDDs max out around 4 threads, while fast SSDs can scale
55+
even up to 24.
56+
57+
#### Docker image
58+
59+
The docker image can be used in two ways:
60+
1. The `zim_to_dir` executable can be copied out of a container and used
61+
as described above. For instance:
62+
```
63+
$ docker create zim_to_dir
64+
e892d6ff245b55e03e41384d1e7d2838babd944a8e31096b3677a05359f38aba
65+
$ docker cp e892d6ff245b:/zim_to_dir .
66+
$ docker rm e892d6ff245b
67+
e892d6ff245b
68+
```
69+
2. The container is also runnable and will run `zim_to_dir` by default. However,
70+
in order for the container to see the input and output directories, they must
71+
be mounted as volumes:
72+
```
73+
docker run --rm --mount type=bind,source=/home/user/data/,target=/data zim_to_dir -i /data/wikipedia_hu_all_mini.zim -o /data/hu_mini/ -d 2000
74+
```
75+
2576
### Compiling the code
2677

2778
The script can be compiled with issuing the `make` command in the `src`
28-
directory. There are a few caveats.
79+
directory. There are a few caveats, and because of this, it is easier to
80+
build the docker image, which compiles the source and all its dependencies:
81+
82+
```
83+
cd docker
84+
docker build -t zim_to_dir .
85+
```
86+
87+
This method has the added benefit of not polluting the system with potentially
88+
unneeded libraries and packages and it also works without `root` access.
89+
90+
For those who wish to compile the code manually, here we present the general
91+
guidelines. Check out the `Dockerfile` for the detailed list of commands.
2992

3093
#### Compiler
3194

@@ -51,16 +114,17 @@ git submodule update
51114

52115
Aside from these, two other libraries (and their sources or `-dev` packages) are required:
53116

54-
- [`libzim`](https://github.com/openzim/libzim) (also called Zimlib) to process
55-
the files. Libzim can be installed from the repositories of Linux
56-
distributions (`libzim-dev`), or compiled from source;
57-
- `zlib`, for compression (e.g. `zlib1g-dev` in Ubuntu).
117+
1. `zlib`, for compression (e.g. `zlib1g-dev` in Ubuntu);
118+
2. [`libzim`](https://github.com/openzim/libzim) (also called Zimlib) to
119+
process the files. Libzim can be installed from the repositories of Linux
120+
distributions (`libzim-dev`), but e.g. Ubuntu only has version 4, so
121+
depending on how recent is the file to process, it might have to be
122+
compiled [from source](https://github.com/openzim/libzim).
58123

59124
Note that some of the files in the Kiwix archives (most importantly, the
60-
English WP dump) require a recent version of libzim. A libzim version between
61-
4.0 and 6.3 is recommended; note that the API changed in 7.0, and
62-
`zim_to_dir` is not yet compatible with it. The version in recent Ubuntu
63-
releases should work without problems.
125+
English WP dump) require a fresh version of libzim. libzim version
126+
6.3 is recommended; note that the API changed in 7.0, and
127+
`zim_to_dir` is not yet compatible with it.
64128

65129
### Troubleshooting
66130

docker/Dockerfile

+39
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,39 @@
1+
# Using the oldest working release so that we are compatible with as many
2+
# releases as possible (compiling on a newer release would make us end up
3+
# glibc tokens that don't exist on older releases)
4+
FROM ubuntu:18.10
5+
6+
# This is only needed for unsupported releases, such as 18.10
7+
RUN cat /etc/apt/sources.list | sed -e "s/archive/old-releases/" -e "s/security/old-releases/" > xxx && mv xxx /etc/apt/sources.list
8+
# Install get so that we can obtain libzim
9+
RUN apt update
10+
11+
# To make sure tzdata (or other packages) don't ask questions
12+
# See https://serverfault.com/questions/949991/
13+
ARG DEBIAN_FRONTEND=noninteractive
14+
ENV TZ=Europe/Budapest
15+
16+
RUN apt install -y git
17+
# Build tools
18+
RUN apt install -y build-essential meson pkg-config
19+
# libzim dependencies
20+
RUN apt install -y liblzma-dev libicu-dev libzstd-dev uuid-dev
21+
# zim_to_dir dependencies
22+
RUN apt install -y zlib1g-dev
23+
24+
# Clone the repositories we need
25+
RUN git clone --depth 1 --branch 6.3.2 https://github.com/openzim/libzim.git
26+
RUN git clone --depth 1 --branch docker --recursive https://github.com/DavidNemeskey/zim_to_corpus.git
27+
28+
# Compile and install libzim
29+
WORKDIR "/libzim"
30+
RUN meson . build -Dwith_xapian=false --default-library=static
31+
RUN ninja -C build
32+
RUN ninja -C build install
33+
34+
# Compile zim_to_dir
35+
WORKDIR "/zim_to_corpus/src"
36+
RUN make
37+
RUN cp zim_to_dir /
38+
39+
ENTRYPOINT ["/zim_to_dir"]

src/Makefile

+1-1
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@ CPPFLAGS=-g -O3 -I cxxopts/include/ -I zstr/src/ -I spdlog/include -std=c++17
99
# Static libstdc++ can be removed if the target machine has the same or
1010
# newer GLIBCXX version
1111
LDFLAGS=-g -static-libstdc++
12-
LDLIBS=-lzim $(if $(findstring 8,$(CXX_VER)), -lstdc++fs) -lz -pthread
12+
LDLIBS=$(if $(findstring 8,$(CXX_VER)), -lstdc++fs) -l:libzim.a -l:liblzma.a -l:libzstd.a -lz -pthread
1313

1414
SRCS=zim_to_dir.cpp
1515
OBJS=$(subst .cpp,.o,$(SRCS))

src/zim_to_dir.cpp

+1-1
Original file line numberDiff line numberDiff line change
@@ -81,7 +81,7 @@ class ArgumentParser {
8181
("Z,zeroes", "the number of zeroes in the output files' names.",
8282
cxxopts::value<size_t>()->default_value("4"))
8383
("T,threads", "the number of parallel threads to use.",
84-
cxxopts::value<size_t>()->default_value("10"))
84+
cxxopts::value<size_t>()->default_value("4"))
8585
("L,log-level", "the logging level. One of "
8686
"{critical, error, warn, info, debug, trace}.",
8787
cxxopts::value<std::string>()->default_value("info"))

0 commit comments

Comments
 (0)