|
1 |
| -This repository is a placeholder that will be populated soon. |
| 1 | +[](https://zenodo.org/doi/10.5281/zenodo.11474109) |
2 | 2 |
|
3 |
| -See https://www.researchsquare.com/article/rs-3405645 for details |
| 3 | +# GenSynthPop |
| 4 | + |
| 5 | +This repository contains the implementation of GenSynthPop, |
| 6 | +a sample-free tool to construct Synthetic Populations and Households from mixed-aggregation contingency tables |
| 7 | + |
| 8 | +The work in this repository is described in |
| 9 | +*GenSynthPop: Generating a Spatially Explicit Synthetic Population of Agents and Households from Aggregated Data*[^1]. |
| 10 | + |
| 11 | +[^1]: Marco Pellegrino, Jan de Mooij, Tabea Sonnenschein et al. *GenSynthPop: Generating a Spatially Explicit Synthetic |
| 12 | +Population of Agents and Households from Aggregated Data*, 09 October 2023, PREPRINT (Version 1) available at Research |
| 13 | +Square [https://doi.org/10.21203/rs.3.rs-3405645/v1] |
| 14 | + |
| 15 | +An R implementation of this library is available [here](https://github.com/TabeaSonnenschein/GenSynthPop) |
| 16 | + |
| 17 | +A reference implementation is available |
| 18 | +[here](https://github.com/A-Practical-Agent-Programming-Language/Synthetic-Population-The-Hague-South-West) |
| 19 | + |
| 20 | +## Intuition |
| 21 | + |
| 22 | +This library allows generating a synthetic population from contingency tables and marginal data one attribute at the |
| 23 | +time. It does not assume the presence of a detailed sample. Refer to the |
| 24 | +[reference implementation](https://github.com/A-Practical-Agent-Programming-Language/Synthetic-Population-The-Hague-South-West) |
| 25 | +for details. |
| 26 | + |
| 27 | +Generally when data is published there is a trade-off in accuracy between the joint distribution of attributes and the |
| 28 | +spatial granularity. Detailed data may be available for very small regions, but only contain a few or even one |
| 29 | +attribute. |
| 30 | +In order to achieve the best spatial heterogeneity, one would prefer to use those data. However, in order to get an |
| 31 | +accurate representation of the population as a whole, the joint distirbution of attributes is relevant. |
| 32 | + |
| 33 | +This library allows combining both. It assumes for any given attribute, there is marginal data available at a high |
| 34 | +spatial resolution and a contingency table at lower levels of spatial resolution, and allows combining the two. |
| 35 | + |
| 36 | +With this library, each attribute is added to the population one at a time, conditioned (based on the contingency table) |
| 37 | +on previously added attributes but constrained on spatial details (based on the high spatial resolution marginal data). |
| 38 | + |
| 39 | +Note that this library assumes for each attribute, a contingency table is available or can be obtained that exhaustively |
| 40 | +lists the possible categorical values the attribute can take. If this table is not available as is, data preparation is |
| 41 | +required before this library can be used |
| 42 | + |
| 43 | +## Generating a synthetic population |
| 44 | + |
| 45 | +In the reference implementation, the highest spatial resolution data was available at the level of a neighborhood, so |
| 46 | +that example will be maintained here. If only one level of spatial resolution is available, this library provides |
| 47 | +less utility, but a column with a single value can be added to the source data to still use this library. |
| 48 | + |
| 49 | +## Generate individuals |
| 50 | + |
| 51 | +Start by instantiating a data frame with agent IDs located in each of the neighborhoods: |
| 52 | + |
| 53 | +```python |
| 54 | +agent_ids = list() |
| 55 | +agent_neighborhoods = list() |
| 56 | +agent_count = 0 |
| 57 | +for neighb_code, (neighb_total) in read_marginal_data(['population'], 'population').iterrows(): |
| 58 | + agent_ids += [f"SA{i + agent_count:06d}" for i in range(neighb_total.iloc[0])] |
| 59 | + agent_neighborhoods += [neighb_code] * neighb_total.iloc[0] |
| 60 | + agent_count += neighb_total.iloc[0] |
| 61 | +df_synthetic_population = pd.DataFrame(data=dict(agent_id=agent_ids, neighb_code=agent_neighborhoods)) |
| 62 | +``` |
| 63 | + |
| 64 | +Next, add an attribute. For example age group. This is done through the |
| 65 | +[Conditional Attribute Adder](gensynthpop/conditional_attribute_adder.py). |
| 66 | + |
| 67 | +Pass in the synthetic population created in the first step, a contingency table that for each age group in the index |
| 68 | +specifies the number of people in that age group in each neighborhood. |
| 69 | + |
| 70 | +The `group_by` clause indicates by what column(s) the data is split into high spatial resolution. |
| 71 | + |
| 72 | +```python3 |
| 73 | +from gensynthpop.conditional_attribute_adder import ConditionalAttributeAdder |
| 74 | + |
| 75 | +df_age_group = pd.read_csv('age_group_marginal_data.csv') |
| 76 | +df_synthetic_population = ConditionalAttributeAdder( |
| 77 | + df_synthetic_population=df_synthetic_population, |
| 78 | + df_contingency=df_age_group, |
| 79 | + target_attribute='age_group', |
| 80 | + group_by=['neighb_code'] |
| 81 | +).run() |
| 82 | +``` |
| 83 | + |
| 84 | +Next, an attribute can be added conditioned on a previous attribute, provided a contingency table containing at least |
| 85 | +that previous attribute is available. |
| 86 | + |
| 87 | +Best results are achieved if the Iterative Proportional Fitting procedure |
| 88 | +(e.g. [ipfn-python](https://github.com/AJdeMooij/ipfn/tree/bugfix/pandas-sort-frames)) is applied to the contingency |
| 89 | +table first to fit the contingency table to the margins of the attributes already added. |
| 90 | + |
| 91 | +The process is similar to before, but can now specify neighborhood-specific margins by calling `add_margins`. The |
| 92 | +arguments work the same as with the |
| 93 | +[ipfn-python](https://github.com/AJdeMooij/ipfn/tree/bugfix/pandas-sort-frames) library |
| 94 | + |
| 95 | +```python3 |
| 96 | +from gensynthpop.conditional_attribute_adder import ConditionalAttributeAdder |
| 97 | + |
| 98 | +df_margins_gender = pd.read_csv('marginal_gender_by_neighborhood.csv') |
| 99 | +df_margins_age_group = synthetic_population_to_contingency( |
| 100 | + df_synth_pop, ["neighb_code", "age_group"], True).reset_index() |
| 101 | + |
| 102 | +df_gender_contingency = read_and_fit_gender_contingency_table(df_margins_age_group) |
| 103 | + |
| 104 | +df = ConditionalAttributeAdder( |
| 105 | + df_synthetic_population=df_synthetic_population, |
| 106 | + df_contingency=df_gender_contingency, |
| 107 | + target_attribute="gender", |
| 108 | + group_by=["neighb_code"] |
| 109 | +).add_margins( |
| 110 | + margins=[df_margins_age_group, df_margins_gender], |
| 111 | + margins_names=[["age_group"], ["gender"]] |
| 112 | +).run() |
| 113 | +`````` |
| 114 | + |
| 115 | +The process can be repeated for as many attributes as necessary. |
| 116 | + |
| 117 | +## Generate households |
| 118 | + |
| 119 | +After sufficient attributes are added to the individuals, they can be partitioned into households. |
| 120 | + |
| 121 | +The first step is to determine the types of households (e.g., singles, couples without children, couples with 1 child, |
| 122 | +couples with 2 children, etc...). |
| 123 | + |
| 124 | +Next, each agent is assigned a *household position* as either an adult or (in the case of households with children) |
| 125 | +child in one of those exact households. If a contingency table is available, that is great, but most likely, this |
| 126 | +contingency table needs to be constructed. Once it is, add it as an individual-level attribute. |
| 127 | + |
| 128 | +Next, the [Household Grouper](gensynthpop/household_grouper.py) can be used to partition the agents into households |
| 129 | +based on their household position. The household grouper aims at placing each agent into a household and generates |
| 130 | +households as required. It may switch positions of agents if necessary to fulfill the household constraints. |
| 131 | + |
| 132 | +In order to create the households, three Pandas Series are required: |
| 133 | + |
| 134 | +1) The typical gender distribution between adult partners (`male-female`, `male-male` and `female-female` as index) |
| 135 | +2) The typical age disparity between partners of the same gender, where the index defines a range |
| 136 | + as `<range_start>-<range_end>` |
| 137 | +3) The typical age disparity between a mother and oldest child, with the range again as index. |
| 138 | + |
| 139 | +Each series should specify the count or relative frequency of each group. If data is not available, a series can be |
| 140 | +constructed |
| 141 | +with just one record, e.g., `0-999 = 1`, but ideally, these series are constructed from available data for the region of |
| 142 | +interest. |
| 143 | + |
| 144 | +Each `HouseholdType` object takes the name of the household and the three distributions as an argument. Next, its |
| 145 | +constituent members are specified by household type. A household can have two types of members, `adult` and `child`, and |
| 146 | +for each type, the `household position` that was added as an individual level attribute earlier should be specified, |
| 147 | +as well as the number of individuals of that type in the household and the household positions from which members can |
| 148 | +be taken as a backup. |
| 149 | + |
| 150 | +For example, a household with two parents (with household position `couple_with_1_child`) |
| 151 | +and one child (with household position `child_of_couple_with_1_children`) may be specified as follows: |
| 152 | + |
| 153 | +```python3 |
| 154 | +couple_with_1_child = HouseholdType( |
| 155 | + 'couple_with_1_child', |
| 156 | + df_couple_gender_distribution, |
| 157 | + df_couple_age_distribution, |
| 158 | + df_parent_child_age_distribution |
| 159 | +).add_members( |
| 160 | + 'child_of_couple_with_1_children', 'child', 1, [] |
| 161 | +).add_members('couple_with_1_child', 'adult', 2, ['couple_no_children', 'single']) |
| 162 | +``` |
| 163 | + |
| 164 | +If a household has no children, the first call to `add_members` can be omitted. To create single households, the third |
| 165 | +argument of the second call to `add_members` can additionally be changed to `1`. |
| 166 | + |
| 167 | +Once all household types are specified, they can be added to the grouper, which can then create the households: |
| 168 | + |
| 169 | +```python3 |
| 170 | +hh_grouper = HouseholdGrouper(df_synth_pop, ['neighb_code'], 'household_position') |
| 171 | +hh_grouper.add_household_type(couple_with_1_child) |
| 172 | +df_synthetic_population, df_synthetic_households = hh_grouper.run() |
| 173 | +``` |
| 174 | + |
| 175 | +## Next steps |
| 176 | + |
| 177 | +When the households have been created, additional attributes can still be added to the synthetic population of |
| 178 | +individuals, |
| 179 | +or the households can be decorated with additional attributes. The same `ConditionalAttributeAdder` can still be used |
| 180 | +for either. |
0 commit comments