docu_ABM.Rmd

---
title: "ABM User Manual"
author: "Johanna Cronenberg and Michele Gubian"
date: "February 12, 2020"
output: html_document
---

```{r klippy, echo=FALSE, include=TRUE}
klippy::klippy(position = c('top', 'right'), tooltip_message = 'click to copy', tooltip_success = 'done')
```

## Setup

```{r}
R.version.string
```

```{r eval=FALSE}
setwd("/homes/myName/path/to/ABM")
```

- Clone the [github repo](https://github.com/IPS-LMU/ABM) or download the code from there.
- Open RStudio and use `setwd()` as shown above to navigate to the folder of the ABM code (if necessary). This path will the the reference point for all relative paths used in this manual and in the simulations.
- Make sure you have the R version indicated above (or newer) as well as the latest version of all packages (Tools > Check for Package Updates).
- Adapt the ABM settings in `data/params.R`. Explanations of all parameters are provided below.
- Run the ABM using the following command:

```{r eval=FALSE}
source("Rcmd/ABMmain.R")
```

## Simulation settings

The file `data/params.R` stores the list `params`, which specifies the simulation settings. Here you find a short description of every option.

### Input data

```{r eval=FALSE}
inputDataFile = "./data/demo_single_phoneme.csv"    # absolute or relative path to input data
features = c("DCT0", "DCT1", "DCT2")    # the column(s) in inputDataFile that is/are used as features
group = "age"                           # the column in inputDataFile that defines the agents' groups
label = "phoneme"                       # the column in inputDataFile that stores the phonological labels (can be changed)
initial = "initial"                     # the column in inputDataFile that stores the phonological labels (will not be changed)
word = "word"                           # the column in inputDataFile that stores the word labels
speaker = "spk"                         # the column in inputDataFile that stores the speakers' IDs or names
subsetSpeakers = NULL                   # NULL or a vector of strings, e.g. c("spk01", "spk02", "spk03")
subsetLabels = NULL                     # NULL or a vector of strings, e.g. c("a", "i", "u", "o")
```

The input data file can be `.txt` or `.csv` and needs to contain a table (it will be loaded as a `data.table`). You can name the columns any which way you like since you will have to indicate in `data/params.R` which columns store which information: features, group, label, initial, word, speaker (see comments in code snippet above). There can be more columns than needed (they will be ignored), but also more observations. You can subset the speakers and phonological labels to be used in the simulations by setting `subsetSpeakers` and `subsetLabels`.

### Initialisation procedure

```{r eval=FALSE}
createPopulationMethod = "speaker_is_agent"  # "speaker_is_agent" or "bootstrap"
bootstrapPopulationSize = 50            # full positive number; only if createPopulationMethod == "bootstrap"
initialMemoryResampling = FALSE         # enlarge the agents' memories before the interactions or not
initialMemoryResamplingFactor = 1.0     # 1.0 or higher; only if initialMemoryResampling == TRUE
proportionGroupTokens = 0.0             # between 0.0 and 1.0; proportion of tokens from own speaker group that an agent is initialised with
rememberOwnTokens = TRUE                # whether or not to perceive one's own tokens
```

Usually, every human speaker is represented by one agent which is achieved by initialising the agent with the acoustic data of the speaker (`createPopulationMethod = "speaker_is_agent"`). However, there is the possibility to break this paradigm by applying [bootstrapping](https://en.wikipedia.org/wiki/Bootstrapping_(statistics)). If so, you will need to specify how many agents the new population should be comprised of (`bootstrapPopulationSize`). It is highly recommended to then do multiple ABM runs (`runMode = "multiple"`; see below).

Since we often deal with sparse data in the phonetic sciences, it makes sense to augment the amount of data (i.e. tokens per word per speaker) *before* the first interaction takes place (`initialMemoryResampling = TRUE`). If you want to do this (which is recommended in order to avoid artefacts during the simulations), you also need to specify the `initialMemoryResamplingFactor`. The resampling uses the production technique indicated by the parameters in the next section. Be aware that a large resampling factor, e.g. 10, will slow any change down, so many more interactions are needed (`nrOfSnapshots` and `interactionsPerSnapshot`, see below).

`proportionGroupTokens` can be used to initialise agents with a proportion of tokens from their own speech community. So if an agent would usually receive 20 tokens from a real speaker and `proportionGroupTokens = 0.25`, the agent will instead be initialised with 15 tokens from the real speaker (i.e. 75%) and 5 random tokens from speakers of the agent's speaker group. Over the course of the interactions, it is often the case that all agents start to become less individual in their phonetic representations, i.e. all agents' categories collapse into one point. This result can be avoided by setting `rememberOwnTokens = TRUE`. It is then highly recommended to have a `proportionGroupTokens > 0`.

### Production

```{r eval=FALSE}
productionBasis = "word"                # "word" or "label"; estimate Gaussian based on tokens associated with words or labels
productionResampling = "SMOTE"          # NULL or "SMOTE"; apply SMOTE to make Gaussian more stable or not
productionResamplingFallback = "label"  # currently only "label"
productionMinTokens = 20                # only if productionResampling == "SMOTE"; minimum number of tokens to be used in building Gaussian
productionSMOTENN = 5                   # only if productionResampling == "SMOTE"; number of nearest neighbours used in SMOTE
```

Every interaction between an agent-speaker and an agent-listener starts with the production of a word. Then a Gaussian distribution is estimated based on all acoustic tokens that are associated either with the target word (`productionBasis = "word"`) or the phonological label (`productionBasis = "label"`) associated with that word.

Since in some cases the number of available tokens for a given word or label is too small to estimate the parameters of a multi-dimensional Gaussian distribution, extra tokens can be generated using [SMOTE](http://rikunert.com/SMOTE_explained) (`productionResampling = "SMOTE"`). You will have to specify the minimum number of tokens that should be used in order to build a Gaussian distribution (e.g. `productionMinTokens = 20`) as well as the number of nearest neighbours that will be considered when performing the random linear interpolation (e.g. `productionSMOTENN = 5`). Whenever the available number of tokens for a given word is less than one plus the number of nearest neighbours, the neighbourhood is completed by adding tokens from other words containing the same target phoneme (`productionResamplingFallback = "label"`).

Once the Gaussian distribution is estimated, a new acoustic token is sampled from it. The agent-listener receives the acoustic features, the word label, and phonological label.

### Perception

```{r eval=FALSE}
memoryIntakeStrategy = "mahalanobisDistance"              # "maxPosteriorProb" and/or "mahalanobisDistance" and/or "posteriorProbThr"
mahalanobisThreshold = qchisq(.99, df = 3) %>% round(2)   # threshold if memoryIntakeStrategy == "mahalanobisDistance"
posteriorProbThr = 1/3                                    # only if memoryIntakeStrategy == "posteriorProbThr"
perceptionNN = 5      # uneven full number; assign label to unknown word based on majority vote among perceptionNN nearest neighbours
```

It is assumed that word recognition always works, i.e. the agent-listener does *not* decide whether or not he understands the word. Instead, it is decided whether the target phoneme in the produced token is probabilistically close enough to the agent-listener's distribution of that phoneme. If so, the new token is memorised, otherwise a new interaction begins. The parameter `memoryIntakeStrategy` indicates which kind of statistical decision is to be taken (combining strategies is also possible, e.g. `c("maxPosteriorProb", "mahalanobisDistance")`):

- `"maxPosteriorProb"`: Maximum posterior probability decision, i.e. the produced token is only memorised if its probability of belonging to the listener's corresponding phonological category is higher than that of belonging to any of the other categories.
- `"mahalanobisDistance"`: The distance between the token and the corresponding distribution in the listener's memory has to be smaller than a given threshold in order to be incorporated. The threshold is specified by setting `"mahalanobisThreshold"` where the degrees of freedom needs to be `length(params$features)`. This approach does not take into account any of the other phonological categories, i.e. it might not be the appropriate strategy if two or more categories in your data overlap or are very close to each other.
- `"posteriorProbThr"`: The produced token is memorised if its posterior probability of belonging to the phonological category is higher than the threshold indicated by this option.

If a new token is associated with a word that is unknown to an agent-listener, a word label will be assigned to the token based on a majority vote among `perceptionNN` nearest neighbours.

### Forgetting

```{r eval=FALSE}
memoryRemovalStrategy = "random"        # "random" (recommended) or "outlierRemoval" or "timeDecay"
forgettingRate = 0                      # number between 0 and 1
```

It is unlikely that speakers don't ever forget any of the episodic traces they have memorised. However, since we do not have any evidence on how this forgetting process works, we have to be agnostic about it. That is why we recommend "forgetting" a random token that has the same word label as the newly perceived token. For reasons of backward compatibility, `memoryRemovalStrategy` can also be used to delete either the oldest token (`timeDecay`) or the farthest outlier of the phoneme distribution (`outlierRemoval`). Be aware that these have unwanted side effects and produce artefacts.

`forgettingRate` is also part of the perception process (even though it is planned to decouple perception and forgetting completely). This parameter takes any value between 0 and 1; if a randomly generated number is smaller than `forgettingRate` the agent-listener will remove a random token from its memory. This type of forgetting does not take into account to which word type or phoneme the removed token belongs.

### Interactions

```{r eval=FALSE}
interactionPartners = "betweenGroups"   # "random" or "betweenGroups" or "withinGroups"; from which groups the interacting agents must be
speakerProb = NULL                      # NULL or a vector of numerics; whether some agents should speak more often than others
listenerProb = NULL                     # NULL or a vector of numerics; whether some agents should listen more often than others
```

The parameter `interactionPartners` lets you choose from which groups the two interacting agents shall come.

- `"random"`: It does not matter from which group an agent comes
- `"withinGroups"`: Speaker and listener must come from the same group
- `"betweenGroups"`: Speaker and listener must be members of different groups

In certain cases, you may want to introduce an imbalance to the ABM concerning the frequency with which one or more agents are chosen to be speakers or listeners in an interaction. `speakerProb` and `listenerProb` allow you to assign a speaking or listening probability to each agent by specifying a vector of numbers, which do not need to sum up to one, as they will be normalised internally. If left `NULL`, all agents will get equal chances to be selected as speakers or listeners in an interaction.

### Split and merge

```{r eval=FALSE}
splitAndMerge = FALSE                   # apply split & merge algorithm or not
doSplitAndMergeBeforeABM = FALSE        # apply split & merge before the first interaction or not
splitAndMergeInterval = 100             # any full positive number; after how many interactions an agent applies split & merge
```

Perform split and merge. If `splitAndMerge == TRUE`, further parameters have to be set: `splitAndMergeInterval` specifies the number of interactions between each run of split and merge; `doSplitAndMergeBeforeABM` specifies whether to perform split and merge before simulation start. However, please avoid split & merge if possible, since it is still under heavy construction.

### Runs

```{r eval=FALSE}
runMode = "single"                      # "single" or "multiple"
multipleABMRuns = 2                     # any full positive number; number of ABM runs if runMode == "multiple"
nrOfSnapshots = 2                       # any full positive number; how often the population is archived during the simulation
interactionsPerSnapshot = 1000          # any full positive number; how many interactions take place per snapshot
```

The ABM system offers you two ways of running it: You can either perform one ABM run (`runMode = "single"`) in order to see the immediate results or you can perform multiple independent ABM runs (`runMode = "multiple"`) which makes it possible to check whether the ABM delivers stable results on your data. If multiple simulations are run, specify the number of runs by setting the parameter `multipleABMRuns`. In the current version, the `"multiple"` mode makes use of the `parallel` library and it is only supported for Linux. 

A simulation is a sequence of `nrOfSnapshots * interactionsPerSnapshot` interactions. At every `interactionsPerSnapshot` interactions a snapshot of all agents' memories is taken and saved.

## Data structures and files organisation

```{r eval=FALSE}
rootLogDir = "./logs"                   # absolute or relative path to logging directory
```

The results of your simulations will be saved in `rootLogDir` in distinct folders called `ABM<date><time>`, e.g. `ABM20191017115249`. `rootLogDir` will be created if it does not exist yet. There will be a folder for each separate simulation in `rootLogDir/ABM<date><time>/`, called `1/`, `2/`, and so on (or only `1/` for a single simulation). Every simulation folder, e.g. `ABM20191017115249/1`, will contain the following objects:

- `input.rds`: the input data frame without any changes to your original data frame.
- `pop.0.rds`: the input data frame with adapted column names (`P1`, `P2`, etc., `group`, `label`, `initial`, `word`, `speaker`) and comprising only of those speakers and labels that you subsetted there to be (`subsetSpeakers` and `subsetLabels` in `data/params.R`). However there are some further columns:

agentID | producerID | valid | nrOfTimesHeard 
-----------|-----------|-----------|-----------
the ID given to each agent | the ID of the agent-speaker that produced the token (but since no interaction has taken place yet `agentID == producerID`) | used internally to allocate free space in the agents' memories so you should ignore all columns where `valid == FALSE` | is 1 at that point, and always increased by 1 for all tokens of the same word when the agent has perceived that word

timeStamp | equivalence | condition
-----------|-----------|-----------
randomly assigned number for all tokens of a phoneme that is used for `memoryRemovalStrategy = "timeDecay"` | can be ignored (as it is used for `splitAndMerge = TRUE` which also shouldn't be used at the moment) | always `0` at this point, since no interactions have taken place yet; this number is increased by 1 with every snapshot

- `pop.1.rds`, `pop.2.rds`, etc.: the snapshots of the population of which there should be as many as `nrOfSnapshots`. Between every snapshot, as many as `interactionsPerSnapshot` interactions took place. They have the same columns as `pop.0.rds`.
- `intLog.rds`: the interaction log which will have as many observations as `nrOfSnapshots * interactionsPerSnapshot`, i.e. one row per interaction. This data frame has the following columns:

producerID | perceiverID | word | producerLabel | perceiverLabel
-----------|-----------|-----------|-----------|-----------
the agent-speaker's ID | the agent-listener's ID | the word label of the produced token | the phonological label the agent-speaker associated with the token | the phonological label the agent-listener associated with the token

producerNrOfTimesHeard | perceiverNrOfTimesHeard | accepted | simulationNr | valid
-----------|-----------|-----------|-----------|-----------
how often the agent-speaker has encountered the word | how often the agent-listener has encountered the word | whether or not the agent-listener memorised the token | the same as `condition` in the population snapshots | also the same as in the population snapshots, but here it will always be `TRUE`

- `params.yaml`: the saved list of all parameters which you set in `data/params.R`, so it is easy to remember and recreate the simulation you've done. `.yaml` is just a text format, so you can open it with any text editor.

The `.rds` format is a kind of archive which can be read using the following command:

```{r eval=FALSE}
# example of how to read RDS files
intLog <- readRDS(file.path(rootLogDir, 'ABM20191017115249/intLog.rds'))
```

Besides the simulation results, `rootLogDir` also contains `simulations_register.rds`. This is a list of all the simulations that are stored in `rootLogDir`, more precisely of all `params.yaml` files that stored in `rootLogDir`'s subdirectories. You can use the functions in `functions/simulations.R` to e.g. find certain simulations by their parameters, delete unfinished ABM runs from both the `rootLogDir` and the simulation register, etc. Only use the functions provided, do not alter the `simulations_register.rds` directly.

## Analysis of Results

When your simulations are finished, you can use `analysis.R` if you need help with loading, analysing or plotting the results.

## Demo

There is a demo available based on the data from Harrington & Schiel (2017, Language). You can start it by typing the command below; afterwards you'll be led through the demo in the console.

```{r eval=FALSE}
source("functions/demo.R")
```