Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rename dataset and output folder in Prior template #67

Merged
merged 6 commits into from
Oct 24, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -209,9 +209,9 @@ asreview makita template multimodel --classifiers logistic nb --feature_extracto

command: `prior`

The prior template evaluates how large amounts of prior knowledge might affect simulation performance. It processes two types of data in the data folder: labeled dataset(s) to be simulated and labeled dataset(s) to be used as prior knowledge. The filename(s) of the dataset(s) containing the prior knowledge should use the naming prefix `prior_[dataset_name]`.
The prior template evaluates how a set of custom prior knowledge might affect simulation performance. It processes two types of data in the data folder: labeled dataset(s) to be simulated and labeled dataset(s) to be used as prior knowledge. The filename(s) of the dataset(s) containing the custom prior knowledge should use the naming prefix `prior_[dataset_name]`.

The template runs two simulations: the first simulation uses all records from the `prior_` dataset(s) as prior knowledge, and the second uses a 1+1 randomly chosen set of prior knowledge from the non-prior knowledge dataset. Both runs simulate performance on the combined non-prior dataset(s).
The template runs two simulations: the first simulation uses all records from the `prior_` dataset(s) as prior knowledge, and the second uses a 1+1 randomly chosen set of prior knowledge from the non-prior knowledge dataset as a minimal training set. Both runs simulate performance on the combined non-prior dataset(s).

Running this template creates a `generated_data` folder. This folder contains two datasets; `dataset_with_priors.csv` and `dataset_without_priors.csv`. The simulations specified in the generated jobs file will use these datasets for their simulations.

Expand Down
21 changes: 12 additions & 9 deletions asreviewcontrib/makita/template_prior.py
Original file line number Diff line number Diff line change
Expand Up @@ -95,10 +95,10 @@ def get_template_specific_params(self, params):
)
n_runs = self.n_runs if self.n_runs is not None else 1

# Check if at least one dataset with prior knowledge is present
# Check if at least one dataset with custom prior knowledge is present
if self._prior_dataset_count == 0:
raise ValueError(
"At least one dataset with prior knowledge (prefix 'prior_' or \
"At least one dataset with custom prior knowledge (prefix 'prior_' or \
'priors_') is required."
)

Expand All @@ -108,18 +108,21 @@ def get_template_specific_params(self, params):
"At least one dataset without prior knowledge is required."
)

# Print the number of datasets with and without prior knowledge
print(f"\nTotal datasets with prior knowledge: {self._prior_dataset_count}")
# Print the number of datasets with custom and without prior knowledge
print(
f"Total datasets without prior knowledge: {self._non_prior_dataset_count}"
f"\nDatasets with custom prior knowledge: {self._prior_dataset_count}")
print(
f"Datasets without prior knowledge: {self._non_prior_dataset_count}"
)

# Create a directory for generated data if it doesn't already exist
generated_folder = Path("generated_data")
generated_folder.mkdir(parents=True, exist_ok=True)

# Set file paths for datasets with and without prior knowledge
filepath_with_priors = generated_folder / "dataset_with_priors.csv"
# Set file paths for datasets with custom records for prior knowledge
# and without pre-set prior knowledge from which a minimal training
# set of 2 will be selected
filepath_with_priors = generated_folder / "dataset_custom_priors.csv"
filepath_without_priors = generated_folder / "dataset_without_priors.csv"

# Combine all datasets into one DataFrame and remove rows where label is -1
Expand All @@ -136,7 +139,7 @@ def get_template_specific_params(self, params):
combined_dataset["makita_priors"] == 0
].shape[0]

# Print the number of rows with and without prior knowledge
# Print the number of rows with custom and without prior knowledge
print(f"Total rows of prior knowledge: {total_rows_with_priors}")
print(f"Total rows of non-prior knowledge: {total_rows_without_priors}")

Expand All @@ -150,7 +153,7 @@ def get_template_specific_params(self, params):
index_label='record_id'
)

# Create a string of indices for rows with prior knowledge
# Create a string of indices for rows with custom prior knowledge
prior_idx_list = combined_dataset[
combined_dataset["makita_priors"] == 1
].index.tolist()
Expand Down
8 changes: 4 additions & 4 deletions asreviewcontrib/makita/templates/template_prior.txt.template
Original file line number Diff line number Diff line change
Expand Up @@ -39,11 +39,11 @@ python -m asreview wordcloud {{ filepath_without_priors }} -o {{ output_folder }
{% endif %}

{% for run in range(n_runs) %}
python -m asreview simulate {{ filepath_with_priors }} -s {{ output_folder }}/simulation/state_files/sim_with_priors{{ "_{}".format(run) if n_runs > 1 else "" }}.asreview --seed {{ model_seed + run }} -m {{ classifier }} -e {{ feature_extractor }} -q {{ query_strategy }} -b {{ balance_strategy }} --n_instances {{ instances_per_query }} --stop_if {{ stop_if }} --prior_idx {{ prior_idx }}
python -m asreview metrics {{ output_folder }}/simulation/state_files/sim_with_priors{{ "_{}".format(run) if n_runs > 1 else "" }}.asreview -o {{ output_folder }}/simulation/metrics/metrics_sim_with_priors{{ "_{}".format(run) if n_runs > 1 else "" }}.json
python -m asreview simulate {{ filepath_with_priors }} -s {{ output_folder }}/simulation/state_files/sim_custom_priors{{ "_{}".format(run) if n_runs > 1 else "" }}.asreview --seed {{ model_seed + run }} -m {{ classifier }} -e {{ feature_extractor }} -q {{ query_strategy }} -b {{ balance_strategy }} --n_instances {{ instances_per_query }} --stop_if {{ stop_if }} --prior_idx {{ prior_idx }}
python -m asreview metrics {{ output_folder }}/simulation/state_files/sim_custom_priors{{ "_{}".format(run) if n_runs > 1 else "" }}.asreview -o {{ output_folder }}/simulation/metrics/metrics_sim_custom_priors{{ "_{}".format(run) if n_runs > 1 else "" }}.json

python -m asreview simulate {{ filepath_without_priors }} -s {{ output_folder }}/simulation/state_files/sim_without_priors{{ "_{}".format(run) if n_runs > 1 else "" }}.asreview --init_seed {{ init_seed + run }} --seed {{ model_seed + run }} -m {{ classifier }} -e {{ feature_extractor }} -q {{ query_strategy }} -b {{ balance_strategy }} --n_instances {{ instances_per_query }} --stop_if {{ stop_if }}
python -m asreview metrics {{ output_folder }}/simulation/state_files/sim_without_priors{{ "_{}".format(run) if n_runs > 1 else "" }}.asreview -o {{ output_folder }}/simulation/metrics/metrics_sim_without_priors{{ "_{}".format(run) if n_runs > 1 else "" }}.json
python -m asreview simulate {{ filepath_without_priors }} -s {{ output_folder }}/simulation/state_files/sim_minimal_priors{{ "_{}".format(run) if n_runs > 1 else "" }}.asreview --init_seed {{ init_seed + run }} --seed {{ model_seed + run }} -m {{ classifier }} -e {{ feature_extractor }} -q {{ query_strategy }} -b {{ balance_strategy }} --n_instances {{ instances_per_query }} --stop_if {{ stop_if }}
python -m asreview metrics {{ output_folder }}/simulation/state_files/sim_minimal_priors{{ "_{}".format(run) if n_runs > 1 else "" }}.asreview -o {{ output_folder }}/simulation/metrics/metrics_sim_minimal_priors{{ "_{}".format(run) if n_runs > 1 else "" }}.json

{% endfor %}
# Generate plot and tables for dataset
Expand Down