CCR.GB: A Generative Benchmark for Compositional Causal Reasoning Evaluation

Causal reasoning and compositional reasoning are two core aspirations in AI. Measuring these behaviors requires principled evaluation methods. Our work considers both behaviors simultaneously, under the umbrella of compositional causal reasoning (CCR): the ability to infer how causal measures compose and, equivalently, how causal quantities propagate through graphs. The CCR.GB benchmark is designed to measure CCR at all three levels of Pearl's Causal Hierarchy: (1) associational, (2) interventional, and (3) counterfactual.

CCR.GB provides two artifacts:

Random CCR task generator. Open source code for on-demand task generation according to user specifications (graphical complexity, task theme, etc.).
Pre-sampled benchmark dataset. A static dataset sampled from the random task generator, as a starting point for community benchmarking. Static datasets can be found on Hugging Face.

For additional documentation, see our main project page.

Dataset Creation

The static dataset provided in this repository was sampled using our random task generator. Each CCR task is constructed according to the following procedure.

Causal world model. First, we define a fictional world corresponding to a randomly generated causal graph. This will be the causal world model for the LM to reason over. The structural causal model defining our fictitious world is comprised of binary exogenous noise variables, binary endogenous variables, and causal functions (logical operators and, or).
Causal context prompt. Second, we construct a verbal description of the world model. This verbal description — our “causal context prompt” — contains all pertinent details needed for the LM to infer the world model, as well as extraneous details not needed to solve the CCR task. The causal context centers on a user defined theme (e.g., ClinicalNotes, CandyParty, FlowerGarden, FluVaccine, etc.).
Sampling. Third, we randomly sample exogenous variables and extraneous variables and compute true endogenous variable values. Sampled values are then used to construct the "sample context" in natural language, which is concatenated to our causal context prompt. Each causal context will copied many times, where each copy is paired with a new sample context.
Factual query prompts. Next, we construct factual queries by treating the causal context + sample context as observational data. All queries are phrased as yes/no questions. The factual query is then concatenated to a copy of the causal context + sample context. Responses to factual prompts can be used to compute \(p(y \mid x)\) for binary cause \(x\) and binary effect \(y\). Thus, evaluation on factual queries alone tests reasoning at the associational level of Pearl's Causal Hierarchy. Note that evaluation at the associational level is less powerful at distinguishing recall from reasoning than the higher levels of the Causal Hierarchy.
Interventional query pairs. Finally, we construct paired interventional queries corresponding to interventions \(do(X = True)\) and \(do(X = False)\). Each interventional query is individually concatenated to a copy of the causal context + sample context. As with factual queries, all interventional queries are phrased as yes/no questions. Responses to interventional prompts are used to compute \(p(y \mid do(X = True))\) and \(p(y \mid do(X = False))\). As matched pairs over the same sample context, these are also used to compute the PNS: \(p(y \mid do(X = True)) - p(y \mid do(X = False))\). Thus, evaluation on interventional prompts tests for reasoning at both the interventional and counterfactual rungs of Pearl's Causal Hierarchy.

Repository structure

# Python scripts for generating random CCR tasks.
├── candy_party.py
├── clinical_notes.py
├── flower_garden.py
├── flu_vaccine.py
├── task_generator.py
├── dataset_generator.py
├── utils.py

# Walk-through for using CCR.GB for reasoning evaluation.
├── demos
│   └── demo_clinical_notes.ipynb

# Error testing.
├── error_test_candy_party.ipynb
├── error_test_clinical_notes.ipynb
├── error_test_flower_garden.ipynb
├── error_test_flu_vaccine.ipynb

# Code used to generate the static datasets on Hugging Face.
├── generate_clinical_notes.ipynb

# Requirements for reproducibility.
└── requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CCR.GB: A Generative Benchmark for Compositional Causal Reasoning Evaluation

Dataset Creation

Repository structure

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 35 Commits
.ipynb_checkpoints		.ipynb_checkpoints
__pycache__		__pycache__
demos		demos
.DS_Store		.DS_Store
README.md		README.md
candy_party.py		candy_party.py
cell_bio.py		cell_bio.py
clinical_notes.py		clinical_notes.py
clinical_notes_threshold_dict.npy		clinical_notes_threshold_dict.npy
dataset_generator.py		dataset_generator.py
error_test_candy_party.ipynb		error_test_candy_party.ipynb
error_test_cell_bio.ipynb		error_test_cell_bio.ipynb
error_test_clinical_notes.ipynb		error_test_clinical_notes.ipynb
error_test_flower_garden.ipynb		error_test_flower_garden.ipynb
error_test_flu_vaccine.ipynb		error_test_flu_vaccine.ipynb
flower_garden.py		flower_garden.py
flu_vaccine.py		flu_vaccine.py
generate_clinical_notes.ipynb		generate_clinical_notes.ipynb
generate_flu_vaccine.ipynb		generate_flu_vaccine.ipynb
requirements.txt		requirements.txt
task_generator.py		task_generator.py
utils.py		utils.py

jmaasch/ccr

Folders and files

Latest commit

History

Repository files navigation

CCR.GB: A Generative Benchmark for Compositional Causal Reasoning Evaluation

Dataset Creation

Repository structure

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages