Skip to content

chapter 2: add new section on Environments #14

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Apr 10, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 11 additions & 0 deletions resources/sourcecode/python.dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
FROM buildpack-deps:bookworm
# ...
RUN set -eux; \
apt-get update; \
apt-get install -y --no-install-recommends \
libbluetooth-dev \
tk-dev \
uuid-dev \
; \
rm -rf /var/lib/apt/lists/*
# ...
110 changes: 110 additions & 0 deletions src/thesis/2-reproducibility.typ
Original file line number Diff line number Diff line change
Expand Up @@ -1462,6 +1462,116 @@ and at any point in the past or future​​​​.
environments or machines.
]

=== Computational Environments <ch2-environments>

Environments where a build or computational process occurs can be broadly
categorised into two types: hardware and software environments
#cite(<strangfeld_2024>,form:"normal", supplement: "p. 8, section 2.1"). While
software environments can be managed to a high degree of consistency, achieving
reproducibility across different hardware, particularly different #gls("CPU")
architectures #eg[`x86`, `ARM`], is essentially impossible. Tasks like
instruction execution, memory management, and floating-point calculations are
handled in distinct ways. Even small variations in these processes can lead to
differences in output. Consequently, even with identical software, builds on
different types of #gls("CPU") architectures will produce different results.
When something is said to be reproducible, it typically means reproducible
within the same #gls("CPU") architecture. Therefore, this section will focus
exclusively on the reproducibility challenges within software environments.

A software environment is composed of the #gls("OS"), along with the set of
tools, libraries, and dependencies required to build or run a specific
application. Any change in these components can influence the outcome of a
software build or execution. For example, a minor update to a library could
potentially alter the behaviour of the software, producing different outcomes
across different executions​​ or more importantly, have an impact on the security
level.

To enhance reproducibility, it is critical to ensure that the software
environment remains stable and unaltered during both the build and execution
phases. Unfortunately, conventional #glspl("OS") such as Linux distributions,
Microsoft Windows, and macOS, are #emph[mutable] by default. This mutability is
primarily facilitated through package managers, which enable users to easily
modify their environments by installing or upgrading software packages​. As a
result, uncontrolled changes to dependencies may also lead to inconsistencies in
software behaviour, or have a impact on the security level, undermining
reproducibility​.

To mitigate these issues, #emph[immutable] environments have gained popularity.
Tools such as Docker #cite(<docker>,form:"normal") provide mechanisms to
encapsulate software and their dependencies in containers, thus creating
environments that remain unchanged after creation. Once a container is built, it
can be shared and executed across different systems with the guarantee that it
will function identically, given the same environment. This characteristic makes
containers highly suitable for distributing software.

Despite the advantages of immutability, it does not guarantee reproducibility.
For instance, container images hosted on platforms like Docker Hub
#cite(<dockerhub>,form:"normal"), including popular language interpreters
#eg[Python, NodeJS, PHP], may not be reproducible due to non-deterministic
steps during the image creation (at build-time). A specific example can be found
in #ref(<python-dockerfile>), which runs `apt-get update` at line 4 as part of
the image build process. Since `apt-get` pulls the very latest version of
package index during its creation, it is impossible to build again the same
image later, compromising Docker's build-time reproducibility.

#figure(
sourcefile(
lang: "dockerfile",
read("../../resources/sourcecode/python.dockerfile"),
),
caption: [
An excerpt of the Python's Dockerfile
#cite(<python-dockerfile-repository>,form:"normal") used to build the
#emph[official] Python images.
],
) <python-dockerfile>

Docker images, once built, are immutable. While Docker does not guarantee
build-time reproducibility, it has the potential to ensure run-time
reproducibility, reflecting Docker's philosophy of
#emph["build once, use everywhere"]. This distinction between build-time
reproducibility (@def-reproducibility-build-time) and run-time reproducibility
(@def-reproducibility-run-time) is key. Docker does not ensure that an image
will always be built consistently, often due to the base image used (as
declared in the `FROM` directive of a `Dockerfile`), as seen in
@python-dockerfile. Although building a reproducible image with Docker is
technically possible, it would require additional effort, external tools, and a
more complex setup. Therefore, we assume that build-time reproducibility is not
guaranteed, but the immutability of the environment significantly enhances the
potential for reproducibility at run-time.

#info-box(kind: "important")[
Docker is a platform for building, shipping, and running applications in
containers, with Docker Hub #cite(<dockerhub>,form:"normal") providing a large
repository of container images, which has significantly contributed to
Docker's popularity. Among these are the #emph[Docker "official" images]
#cite(<dockerofficialimages>,form:"normal"), which are curated and reviewed by
the Docker community. These images offer standard environments for popular
software and adhere to some quality standards.

However, the term "official" can be misleading. One might suggest that these
images are maintained by the original software's developers, but it's not
always the case. For example, the PHP Docker image
#cite(<dockerhubphpimage>,form:"normal") is not maintained by the core PHP
development team. This means updates or fixes may not be as prompt or
specific as if the software’s developers maintained the image.

While Docker vets these images for quality, responsibility for the contents
rests with the maintainers. Users should be aware that official images are not
immune to security risks or outdated software, and reviewing the documentation
for issues is advisable.

In summary, Docker "official" images are trusted but may not be maintained by
the original software’s maintainers. Developers must use them with caution and
full awareness, particularly in production environments, and ensure that the
images meet their security and functionality requirements.
]

Package managers are a critical aspect of the reproducibility puzzle since they
can manage the state of a computational environment. Without proper control over
how software and their dependencies are resolved and installed, achieving
consistent and reproducible builds becomes difficult​.

=== Sources Of Non-Determinism

In this section we will explore the sources of non-determinism in software
Expand Down
30 changes: 30 additions & 0 deletions src/thesis/literature.bib
Original file line number Diff line number Diff line change
Expand Up @@ -1067,3 +1067,33 @@ @article{4785860
keywords = {Integrated circuits;Computers;Silicon;Films;Heating;Microwave amplifiers;Data mining},
doi = {10.1109/N-SSC.2006.4785860}
}

@misc{python-dockerfile-repository,
title = {Python 3.12 Dockerfile},
author = {docker-library project1},
year = 2024,
url = {https://github.com/docker-library/python/blame/31bbb37b797bd5521d6622c6d54052d6d0ede585/3.12/bookworm/Dockerfile}
}

@misc{dockerofficialimages,
title = {What are official images},
author = {Docker Inc.},
year = 2024,
url = {https://github.com/docker-library/official-images/blob/6b4803e65a2c56f15b91f8a11bd90f0bcb756c1c/README.md#what-are-official-images},
}

@misc{dockerhubphpimage,
title = {Docker PHP images},
author = {{Docker, Inc.}},
year = 2013,
url = {https://hub.docker.com/_/php/}
}

@article{strangfeld_2024,
author = {Strangfeld, Marvin},
title = {{Reproducibility of Computational Environments for Software Development}},
school = {RWTH Aachen University},
year = 2024,
month = oct,
doi = {10.5281/zenodo.13843189},
}
Loading