An in-depth Investigation on the Behaviour of Measures to Quantify Reproducibility

This repository contains the source code and scripts. The accompanying data and experimental outcomes can be found on Zenodo.

Abstract

Science is now facing a so-called reproducibility crisis, i.e., when researchers repeat an experiment they struggle to get the same or comparable results. This represents a fundamental problem in any scientific discipline because reproducibility lies at the very basis of the scientific method. This paper focuses on measures to quantify reproducibility in IR and their behavior. Current reproducibility practices rely mainly on the comparison of averaged scores: if the reproduced score is close enough to the original one, the reproducibility experiment is deemed successful. We generate reproducibility runs in a controlled experimental setting, which allows us to control the amount of reproducibility error. We investigate the behaviour of different reproducibility measures with these synthetic runs in 3 different scenarios. Experimental results show that a single score is not enough to decide whether an experiment is successfully reproduced because such score depends on the type of effectiveness measure and the performance of the original run. This highlights how challenging it can be not only to reproduce experimental results but also to quantify the amount of reproducibility.

Instructions to reproduce the setup and experiments

Clone this repository:

git clone --recurse-submodules https://github.com/irgroup/ipm-reproducibility.git

Build anserini and the corresponding tools: sh build.sh
Install requirements: pip install -r requirements.txt
Specify the paths to the input data in make_config.py, e.g. for the TREC Washington Post Collection:

'core18': {'input': '/SPECIFY/YOUR/PATH/HERE',
           'collection': 'WashingtonPostCollection',        
           'generator': 'WashingtonPostGenerator',
           'threads': '1'}

Specifiy the path to your Java 11 installation and add it with os.environ['JAVA_HOME'] in make_index.py and search.py.
Make the index.config and search.config files with: python scripts/make_config.py
Build indexes with make_index.py: python scripts/make_index.py
Retrieval of real runs: python scripts/search.py
Simulate runs: python scripts/simulate_run.py
Run the Juypter notebooks in notebooks/ to perform swaps and replacements and make the heatmaps.

Name		Name	Last commit message	Last commit date
Latest commit History 42 Commits
anserini @ 0b80a02		anserini @ 0b80a02
data		data
deterioration_functions		deterioration_functions
figure		figure
notebooks		notebooks
scripts		scripts
stopwords @ ab85d86		stopwords @ ab85d86
tables		tables
toy_example		toy_example
.gitignore		.gitignore
.gitmodules		.gitmodules
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

An in-depth Investigation on the Behaviour of Measures to Quantify Reproducibility

Abstract

Instructions to reproduce the setup and experiments

About

Releases

Packages

Languages

License

irgroup/ipm-reproducibility

Folders and files

Latest commit

History

Repository files navigation

An in-depth Investigation on the Behaviour of Measures to Quantify Reproducibility

Abstract

Instructions to reproduce the setup and experiments

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages