Skip to content

Commit bdd1d22

Browse files
committed
Ready for publication
1 parent 321563b commit bdd1d22

14 files changed

+2555
-3
lines changed

.gitignore

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -159,4 +159,4 @@ cython_debug/
159159
# be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore
160160
# and can be added to the global gitignore or merged into this file. For a more nuclear
161161
# option (not recommended) you can uncomment the following to ignore the entire idea folder.
162-
#.idea/
162+
.idea/

CITATION.cff

Lines changed: 64 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,64 @@
1+
# This CITATION.cff file was generated with cffinit.
2+
# Visit https://bit.ly/cffinit to generate yours today!
3+
4+
cff-version: 1.2.0
5+
title: >-
6+
GenSynthPop-Python: Generating a Spatially Explicit Synthetic
7+
Population of Agents and Households from Aggregated Data
8+
message: >-
9+
If you use this software please cite as below
10+
type: software
11+
authors:
12+
- given-names: Jan
13+
family-names: Mooij
14+
email: A.J.deMooij@uu.nl
15+
affiliation: >-
16+
Intelligent Systems, Information and Computing
17+
Sciences, Utrecht University, Princetonplein 5,
18+
Utrecht, 3584 CC, The Netherlands
19+
orcid: 'https://orcid.org/0000-0003-4129-6074'
20+
name-particle: de
21+
- given-names: Tabea
22+
family-names: Sonnenschein
23+
email: T.S.Sonnenschein@uu.nl
24+
affiliation: >-
25+
Department of Human Geography and Spatial Planning,
26+
Utrecht University, Heidelberglaan 8, Utrecht, 3584
27+
CS, The Netherlands
28+
orcid: 'https://orcid.org/0000-0001-6592-9548'
29+
- family-names: Pellegrino
30+
given-names: Marco
31+
- given-names: Dastani
32+
family-names: Mehdi
33+
affiliation: >-
34+
Intelligent Systems, Information and Computing
35+
Sciences, Utrecht University, Princetonplein 5,
36+
Utrecht, 3584 CC, The Netherlands
37+
- given-names: Dick
38+
family-names: Ettema
39+
affiliation: >-
40+
Department of Human Geography and Spatial Planning,
41+
Utrecht University, Heidelberglaan 8, Utrecht, 3584
42+
CS, The Netherlands
43+
- orcid: 'https://orcid.org/0000-0003-0648-7107'
44+
given-names: Brian
45+
family-names: Logan
46+
affiliation: >-
47+
Intelligent Systems, Information and Computing
48+
Sciences, Utrecht University, Princetonplein 5,
49+
Utrecht, 3584 CC, The Netherlands
50+
- given-names: Judith A.
51+
family-names: Verstegen
52+
affiliation: >-
53+
Department of Human Geography and Spatial Planning,
54+
Utrecht University, Heidelberglaan 8, Utrecht, 3584
55+
CS, The Netherlands
56+
orcid: 'https://orcid.org/0000-0002-9082-4323'
57+
identifiers:
58+
- type: doi
59+
value: 10.5281/zenodo.11474110
60+
repository-code: >-
61+
https://github.com/A-Practical-Agent-Programming-Language/GenSynthPop-Python
62+
repository: >-
63+
https://github.com/A-Practical-Agent-Programming-Language/Synthetic-Population-The-Hague-South-West
64+
license: GPL-3.0

LICENSE.txt

Lines changed: 674 additions & 0 deletions
Large diffs are not rendered by default.

README.md

Lines changed: 179 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,180 @@
1-
This repository is a placeholder that will be populated soon.
1+
[![DOI](https://zenodo.org/badge/810318631.svg)](https://zenodo.org/doi/10.5281/zenodo.11474109)
22

3-
See https://www.researchsquare.com/article/rs-3405645 for details
3+
# GenSynthPop
4+
5+
This repository contains the implementation of GenSynthPop,
6+
a sample-free tool to construct Synthetic Populations and Households from mixed-aggregation contingency tables
7+
8+
The work in this repository is described in
9+
*GenSynthPop: Generating a Spatially Explicit Synthetic Population of Agents and Households from Aggregated Data*[^1].
10+
11+
[^1]: Marco Pellegrino, Jan de Mooij, Tabea Sonnenschein et al. *GenSynthPop: Generating a Spatially Explicit Synthetic
12+
Population of Agents and Households from Aggregated Data*, 09 October 2023, PREPRINT (Version 1) available at Research
13+
Square [https://doi.org/10.21203/rs.3.rs-3405645/v1]
14+
15+
An R implementation of this library is available [here](https://github.com/TabeaSonnenschein/GenSynthPop)
16+
17+
A reference implementation is available
18+
[here](https://github.com/A-Practical-Agent-Programming-Language/Synthetic-Population-The-Hague-South-West)
19+
20+
## Intuition
21+
22+
This library allows generating a synthetic population from contingency tables and marginal data one attribute at the
23+
time. It does not assume the presence of a detailed sample. Refer to the
24+
[reference implementation](https://github.com/A-Practical-Agent-Programming-Language/Synthetic-Population-The-Hague-South-West)
25+
for details.
26+
27+
Generally when data is published there is a trade-off in accuracy between the joint distribution of attributes and the
28+
spatial granularity. Detailed data may be available for very small regions, but only contain a few or even one
29+
attribute.
30+
In order to achieve the best spatial heterogeneity, one would prefer to use those data. However, in order to get an
31+
accurate representation of the population as a whole, the joint distirbution of attributes is relevant.
32+
33+
This library allows combining both. It assumes for any given attribute, there is marginal data available at a high
34+
spatial resolution and a contingency table at lower levels of spatial resolution, and allows combining the two.
35+
36+
With this library, each attribute is added to the population one at a time, conditioned (based on the contingency table)
37+
on previously added attributes but constrained on spatial details (based on the high spatial resolution marginal data).
38+
39+
Note that this library assumes for each attribute, a contingency table is available or can be obtained that exhaustively
40+
lists the possible categorical values the attribute can take. If this table is not available as is, data preparation is
41+
required before this library can be used
42+
43+
## Generating a synthetic population
44+
45+
In the reference implementation, the highest spatial resolution data was available at the level of a neighborhood, so
46+
that example will be maintained here. If only one level of spatial resolution is available, this library provides
47+
less utility, but a column with a single value can be added to the source data to still use this library.
48+
49+
## Generate individuals
50+
51+
Start by instantiating a data frame with agent IDs located in each of the neighborhoods:
52+
53+
```python
54+
agent_ids = list()
55+
agent_neighborhoods = list()
56+
agent_count = 0
57+
for neighb_code, (neighb_total) in read_marginal_data(['population'], 'population').iterrows():
58+
agent_ids += [f"SA{i + agent_count:06d}" for i in range(neighb_total.iloc[0])]
59+
agent_neighborhoods += [neighb_code] * neighb_total.iloc[0]
60+
agent_count += neighb_total.iloc[0]
61+
df_synthetic_population = pd.DataFrame(data=dict(agent_id=agent_ids, neighb_code=agent_neighborhoods))
62+
```
63+
64+
Next, add an attribute. For example age group. This is done through the
65+
[Conditional Attribute Adder](gensynthpop/conditional_attribute_adder.py).
66+
67+
Pass in the synthetic population created in the first step, a contingency table that for each age group in the index
68+
specifies the number of people in that age group in each neighborhood.
69+
70+
The `group_by` clause indicates by what column(s) the data is split into high spatial resolution.
71+
72+
```python3
73+
from gensynthpop.conditional_attribute_adder import ConditionalAttributeAdder
74+
75+
df_age_group = pd.read_csv('age_group_marginal_data.csv')
76+
df_synthetic_population = ConditionalAttributeAdder(
77+
df_synthetic_population=df_synthetic_population,
78+
df_contingency=df_age_group,
79+
target_attribute='age_group',
80+
group_by=['neighb_code']
81+
).run()
82+
```
83+
84+
Next, an attribute can be added conditioned on a previous attribute, provided a contingency table containing at least
85+
that previous attribute is available.
86+
87+
Best results are achieved if the Iterative Proportional Fitting procedure
88+
(e.g. [ipfn-python](https://github.com/AJdeMooij/ipfn/tree/bugfix/pandas-sort-frames)) is applied to the contingency
89+
table first to fit the contingency table to the margins of the attributes already added.
90+
91+
The process is similar to before, but can now specify neighborhood-specific margins by calling `add_margins`. The
92+
arguments work the same as with the
93+
[ipfn-python](https://github.com/AJdeMooij/ipfn/tree/bugfix/pandas-sort-frames) library
94+
95+
```python3
96+
from gensynthpop.conditional_attribute_adder import ConditionalAttributeAdder
97+
98+
df_margins_gender = pd.read_csv('marginal_gender_by_neighborhood.csv')
99+
df_margins_age_group = synthetic_population_to_contingency(
100+
df_synth_pop, ["neighb_code", "age_group"], True).reset_index()
101+
102+
df_gender_contingency = read_and_fit_gender_contingency_table(df_margins_age_group)
103+
104+
df = ConditionalAttributeAdder(
105+
df_synthetic_population=df_synthetic_population,
106+
df_contingency=df_gender_contingency,
107+
target_attribute="gender",
108+
group_by=["neighb_code"]
109+
).add_margins(
110+
margins=[df_margins_age_group, df_margins_gender],
111+
margins_names=[["age_group"], ["gender"]]
112+
).run()
113+
``````
114+
115+
The process can be repeated for as many attributes as necessary.
116+
117+
## Generate households
118+
119+
After sufficient attributes are added to the individuals, they can be partitioned into households.
120+
121+
The first step is to determine the types of households (e.g., singles, couples without children, couples with 1 child,
122+
couples with 2 children, etc...).
123+
124+
Next, each agent is assigned a *household position* as either an adult or (in the case of households with children)
125+
child in one of those exact households. If a contingency table is available, that is great, but most likely, this
126+
contingency table needs to be constructed. Once it is, add it as an individual-level attribute.
127+
128+
Next, the [Household Grouper](gensynthpop/household_grouper.py) can be used to partition the agents into households
129+
based on their household position. The household grouper aims at placing each agent into a household and generates
130+
households as required. It may switch positions of agents if necessary to fulfill the household constraints.
131+
132+
In order to create the households, three Pandas Series are required:
133+
134+
1) The typical gender distribution between adult partners (`male-female`, `male-male` and `female-female` as index)
135+
2) The typical age disparity between partners of the same gender, where the index defines a range
136+
as `<range_start>-<range_end>`
137+
3) The typical age disparity between a mother and oldest child, with the range again as index.
138+
139+
Each series should specify the count or relative frequency of each group. If data is not available, a series can be
140+
constructed
141+
with just one record, e.g., `0-999 = 1`, but ideally, these series are constructed from available data for the region of
142+
interest.
143+
144+
Each `HouseholdType` object takes the name of the household and the three distributions as an argument. Next, its
145+
constituent members are specified by household type. A household can have two types of members, `adult` and `child`, and
146+
for each type, the `household position` that was added as an individual level attribute earlier should be specified,
147+
as well as the number of individuals of that type in the household and the household positions from which members can
148+
be taken as a backup.
149+
150+
For example, a household with two parents (with household position `couple_with_1_child`)
151+
and one child (with household position `child_of_couple_with_1_children`) may be specified as follows:
152+
153+
```python3
154+
couple_with_1_child = HouseholdType(
155+
'couple_with_1_child',
156+
df_couple_gender_distribution,
157+
df_couple_age_distribution,
158+
df_parent_child_age_distribution
159+
).add_members(
160+
'child_of_couple_with_1_children', 'child', 1, []
161+
).add_members('couple_with_1_child', 'adult', 2, ['couple_no_children', 'single'])
162+
```
163+
164+
If a household has no children, the first call to `add_members` can be omitted. To create single households, the third
165+
argument of the second call to `add_members` can additionally be changed to `1`.
166+
167+
Once all household types are specified, they can be added to the grouper, which can then create the households:
168+
169+
```python3
170+
hh_grouper = HouseholdGrouper(df_synth_pop, ['neighb_code'], 'household_position')
171+
hh_grouper.add_household_type(couple_with_1_child)
172+
df_synthetic_population, df_synthetic_households = hh_grouper.run()
173+
```
174+
175+
## Next steps
176+
177+
When the households have been created, additional attributes can still be added to the synthetic population of
178+
individuals,
179+
or the households can be decorated with additional attributes. The same `ConditionalAttributeAdder` can still be used
180+
for either.

gensynthpop/__init__.py

Whitespace-only changes.

0 commit comments

Comments
 (0)