presentation.Rmd

---
output:
  ioslides_presentation:
    widescreen: true
    incremental: true
bibliography: references.bib
link-citations: true
urlcolor: blue
title: Statistical Analysis of the <br> Grit Scale Dataset
subtitle: Submitted to the Department of Statistics <br> of the Athens University of Economics and Business
author: Maximilian Schneider
date: "`r format(Sys.time(), '%d.%m.%Y')`"
---

<style>
.build > * {
    -webkit-transition: opacity 0ms ease-in-out;
    -webkit-transition-delay: 0ms;
    -moz-transition: opacity 0ms ease-in-out 0ms;
    -o-transition: opacity 0ms ease-in-out 0ms;
    transition: opacity 0ms ease-in-out 0ms;
}
slides > slide.title-slide {
    background-image: url('figures/background_title_page_1920x1080.jpg');
    background-size: 100% 100%;
}
slides > slide.dark {
    background-image: url('figures/background_title_page_1920x1080.jpg') !important;
    background-size: 100% 100% !important;
}
slides > slide.title-slide hgroup h1,
slides > slide.title-slide hgroup h2,
slides > slide.title-slide p,
slides > slide,
slides > slide hgroup h2 {
    color: black;
}
slides > slide.dark hgroup h2 {
    color: black;
    font-size: 72px;
    letter-spacing: 10;
}
div.footnotes {
    position: absolute;
    bottom: 0;
    margin-bottom: 10px;
    width: 80%;
    font-size: 0.6em;
}
table {
    caption-side: bottom
}
</style>

<script src="https://ajax.googleapis.com/ajax/libs/jquery/3.1.1/jquery.min.js"></script>
<script>
$(document).ready(function() {
  $('slide:not(.backdrop):not(.title-slide)').append('<div class=\"footnotes\">');

  $('footnote').each(function(index) {
    var text  = $(this).html();
    var fnNum = (index+1).toString();
    $(this).html(fnNum.sup());

    var footnote   = fnNum + '. ' + text + '<br/>';
    var oldContent = $(this).parents('slide').children('div.footnotes').html();
    var newContent = oldContent + footnote;
    $(this).parents('slide').children('div.footnotes').html(newContent);
  });
});
</script>

```{r setup, include=FALSE}
# rmarkdown settings
knitr::opts_chunk$set(fig.align = "center", out.width = '80%', echo = FALSE, cache = FALSE,
                      message = FALSE, warning = FALSE)

# packages
library(data.table)
library(purrr)
library(magrittr)
library(ggplot2); theme_set(theme_bw())
library(mgcv)

# code
source("R/functions.R")
source("R/prepare-data.R")
source("R/application.R")
source("R/plots.R")
```

<!-- TODO

* Figure captions
* Note reference categories

-->

# The Big Five <br> Personality Dimensions

<div class="notes">
> * Scientific discipline: psychology
> * Describing the character of a person
> * Five continuous scales
</div>

## The Big Five personality dimensions<footnote>https://en.wikipedia.org/wiki/Big_Five_personality_traits</footnote>

<div class="notes">
> * Extraversion (outgoing -- solitary / energetic -- reserved)
> * Neuroticism (sensitive -- resilient / nervous -- confident)
> * Agreeableness (friendly -- critical / compassionate -- rational)
> * Conscientiousness (efficient -- extravagant / organized -- careless)
> * Openness to experience (inventive -- consistent / curious -- cautious)
</div>

* Extraversion
* Neuroticism
* Agreeableness
* Conscientiousness
* **Openness to experience**

# The Grid Scale Dataset | <footnote>Duckworth, Angela Lee, and Patrick D Quinn. 2009. “Development and Validation of the Short Grit Scale (GRIT–s).” Journal of Personality Assessment 91 (2): 166–74.</footnote>

<div class="notes">
> * Online questionnaire data
> * Study: "Development and Validation of the Short Grit Scale"
> * Angela Duckworth and Patrick Quinn
> * International Personality Item Pool (IPIP), Demographic data, Technical information, test on vocabulary
</div>

## Openness score

<div class="notes">
> * Likert-type scale, Encoding: 0 (Disagree strongly) to 4 (Agree strongly), NA
> * Questions 2, 4 and 6 other way around -- after transformation all items positively correlated
> * Score: sum of single items divided by four times the number of items answered
> * Possible scores: multiples of $\frac{1}{40}$ (all ten Openness items completed), $\frac{1}{36}$ (nine Openness items completed), $\frac{1}{32}$ ... -- negative skew
</div>

```{r plot-openness}
p2
```

## Eloquence score

<div class="notes">
> * Basis: List of 16 words, 3 unreal
> * "Mark all the words whose definitions you are sure to know."
> * Sum of correct answers
> * Assumption: No NAs
</div>

```{r plot-eloquence}
p1
```

## Other variables

<div class="notes">
> * Scales: treated as nominal
</div>

```{r}
ggtable(dt_bigf, sort(vars_cat)[1:4])
```

## Other variables

<div class="notes">
> * Error in race variable: misc - Indigenous Australian, Native American or White
</div>

```{r}
ggtable(dt_bigf, sort(vars_cat)[5:8])
```

## Other variables

<div class="notes">
> * Black line is the mean
> * Age starting at 13
> * Family size: number of kids of mother -- starting at 1
> * 8 automatically collected technical information: sanity check
</div>

```{r}
cowplot::plot_grid(ggtable(dt_bigf, sort(vars_cat)[9:10]),
                   cowplot::plot_grid(tmpp3, tmpp4, ncol = 1),
                   ncol = 2)
```

## Pairwise associations with Openness

<div class="notes">
> * Consider age and eloquence as non-linear
> * Family size as linear
</div>

```{r openness-continuous}
cowplot::plot_grid(p4, p3, p5, nrow = 3)
```

# Linear Regression Model for Openness

<div class="notes">
> * Full enumeration $\rightarrow$ avoid step-wise procedures (i.e., arbitrary choices) and problems with collinearity: only variables which contain unique information will be used
> * AIC: no theory $\rightarrow$ focus on prediction
</div>

## Model selection

<div class="notes">
> * No clear best model, see next slide
</div>

```{r aics}
res[1:5, .(AIC, Probability = scales::percent(probs, 0.01))] %>%
  knitr::kable(booktabs = TRUE,
               caption = "Top 5 models according to AIC and their probabilities to minimize the (estimated) information loss<footnote>Burnham, K. P.; Anderson, D. R. (2002), Model Selection and Multimodel Inference: A practical information-theoretic approach (2nd ed.), Springer-Verlag.</footnote>")
```

## A posteriori inclusion probabilities

<div class="notes">
> * Only urban and married have a probability lower than 50%
> * Model using all covariates with a post incl prob > 50% is the model with the lowest AIC
> * Second best dropped voted, third best dropped family size $\rightarrow$ good models are similar
</div>

```{r post-inclusion-probs}
p6
```

## Model diagnostics and limitations

<div class="notes">
> * Low explanatory power of the model
> * Trend in the residuals due to discrete dependent variable
> * Residuals with negative skew
</div>

```{r}
n <- summary(m.lam.fin)$n
cowplot::plot_grid(
  data.table(fitted(m.lam.fin), m.lam.fin$model$openness) %>%
    ggplot(aes(V1, V2)) +
    geom_abline(intercept = 0, slope = 1, color = "grey", size = 1.3) +
    geom_jitter(size = 0.5, width = 0, height = 0.005, alpha = 0.4) +
    geom_smooth(se = FALSE, color = "#0075be", size = 1) +
    lims(x = c(0, 1.01), y = c(0, 1.01)) +
    labs(x = "Fitted values", y = "Openness score"),
  data.table(qnorm(seq(1/(2 * n), (2 * n - 1)/(2 * n), 1/n)),
             sort(residuals(m.lam.fin, type = "scaled.pearson"))) %>%
    ggplot(aes(V1, V2)) +
    geom_abline(intercept = 0, slope = 1, color = "grey", size = 1.3) +
    lapply(qnorm(c(0.025, 0.05, 0.25, 0.75, 0.95, 0.975)),
           function(x) geom_vline(xintercept = x)) +
    geom_line() +
    geom_point(alpha = 0.1, size = 0.5) +
    labs(x = "Theoretical Quantiles", y = "Standardized Residuals"),
  data.table(fitted(m.lam.fin),
             residuals(m.lam.fin, type = "scaled.pearson") %>% abs() %>% sqrt()) %>%
    ggplot(aes(V1, V2)) +
    geom_point(alpha = 0.4, size = 0.5) +
    geom_smooth(se = FALSE, color = "#0075be", size = 1) +
    labs(x = "Fitted values", y = bquote(sqrt("Standardized Residuals"))),
  data.table(residuals(m.lam.fin, type = "scaled.pearson")) %>%
    ggplot(aes(V1)) +
    geom_histogram(aes(y = ..density..), bins = 115, color = "#606161", fill = "#606161") +
    geom_function(fun = dnorm, color = "#0075be", size = 1) +
    xlim(-4, 4) +
    labs(x = "Standardized Residuals", y = "Density"),
  nrow = 2)
```

## Model diagnostics and limitations

<div class="notes">
> * Missing data: beyond scope of this course
> * Outlier: n is large
> * Skewness: bootstrap possible
> * Test for normality doesn't make sense: even small departures are significant
</div>

<div style="float: left; width: 50%;">
> * $n_\text{full}$: `r dt_bigf.fix[, .N]` (missing: `r dt_bigf[, .N] - dt_bigf.fix[, .N]` -- `r scales::percent((dt_bigf[, .N] - dt_bigf.fix[, .N]) / dt_bigf[, .N], 0.01)`)
> * $n_\text{final}$: `r n` (missing: `r dt_bigf[, .N] - n` -- `r scales::percent((dt_bigf[, .N] - n) / dt_bigf[, .N], 0.01)`)
> * Effective degrees of freedom: `r sum(m.lam.fin$edf) %>% round(2)`
> * Variance explained: `r scales::percent(summary(m.lam.fin)$dev.expl, 0.01)`
> * Skewed residuals
</div>

<div style="float: right; width: 50%;">
> * Online questionnaire data
> * "Analysis in the void": No context from psychology; why do we want to model Openness like this? \
  $\rightarrow$ Difficult to evaluate importance of violation of assumptions
> * Missing covariates possible
</div>

## Statistical significance

<div class="notes">
> * With enough observations everything is statistically significant
> * Not significant: engnat, family size, orientation, voted
</div>

<div style="float: left; width: 50%;">
```{r}
dt_f <- anova(m.lam.fin) %$%
  cbind(c(pTerms.table[, "F"], s.table[, "F"]), c(pTerms.pv, s.table[, "p-value"]) * 11) %>%
  round(3) %>%
  as.data.table(keep.rownames = TRUE)
dt_f[, V2 := ifelse(V2 > 0.5, ">0.5", V2)]
dt_f[, V2 := ifelse(V2 == "0", "<0.001", V2)]
setnames(dt_f, c("Covariate", "F-statistic", "p-value"))
setkey(dt_f, "Covariate")

knitr::kable(dt_f[1:6,], booktabs = TRUE)
```
</div>

<div style="float: right; width: 50%;">
```{r}
knitr::kable(dt_f[7:11], booktabs = TRUE,
             caption = "Drop one F-tests with Bonferroni correction (approximations for smooth terms<footnote>Wood, S.N. (2017). Generalized Additive Models: An Introduction with R, Second Edition (2nd ed.). Chapman and Hall/CRC.</footnote>)")
```
</div>

## Coefficients

<div class="notes">
> * Do they make sense? / Are they as expected?
> * Ref Cat: Education: Less than high school, Engnat: no, Gender: Male, Hand: right, Orientation: heterosexual, Voted: no
> * Small effect sizes
> * Big CI: low number of observations
</div>

```{r coefficients1}
p7.1
```

## Coefficients

<div class="notes">
> * Race: misc - Indigenous Australian, Native American or White, Religion: agnostic
</div>

```{r coefficients2}
p7.2
```

## Smooth effects

<div class="notes">
> * Centered around zero - interpret changes
> * Age - concerning: in there search for a thing called "character" something consistent over time is desired (cohort effect?)
> * Eloquence makes sense - is part of the definition of Openness
</div>

```{r smooths}
cowplot::plot_grid(p10.1, p10.2, ncol = 1)
```

# Appendix

## Sensitivity - REML

```{r}
#p8
cowplot::plot_grid(ggcoef(m.lam.fin.sel), ggcoef(m.lam.fin))
```

## Sensitivity - splines

```{r}
cowplot::plot_grid(ggcoef(m.lm.fin, -c(27, 31)), ggcoef(m.lam.fin))
```