-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathAnalysis.Rmd
136 lines (106 loc) · 7.38 KB
/
Analysis.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
---
title: "Difference in Distribution of Car Accidents in Poland in 2017, Based on Driver's Gender"
author: "Jakub Tomaszewski"
date: "11/19/2020"
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
library(readxl)
library(dplyr)
```
## Introduction
There is common belief that male are better drivers than female. Intuition says that male may cause more car accidents because they are more risk-takers than female, and more male have driving licence than female. Let's see if that'll hold true when we take into consideration not the absolute numbers of car accidents and driving licences, but rather the proportions. For example, we may find out that even though male cause more accidents, they still may cause proportionally less accidents compared to number of driver's licence male holders than female.
First, we'll take a look at overall statistics and compare them between groups. Then chi-square test will be used. chi-square test is used to determine whether there is a significant association between the two variables. I want to see if driver's gender (male in this case) may be a reason for higher percentage of car accidents caused by male drivers.
### Data Source
There were 2 data sources, both for year 2017:
1. Polish Police annual reports on road accidents (https://statystyka.policja.pl/st/ruch-drogowy/76562,Wypadki-drogowe-raporty-roczne.html)
2. Driving licenses and other permissions data from Central Register of Vehicles and Drivers (http://www.cepik.gov.pl/statystyki)
## Analysis
### First Impression
Provided data is broken down by gender and age group.
```{r full_data}
driving_licence <- readxl::read_excel("driving_licence_2017.xlsx")
knitr::kable(driving_licence)
```
```{r total, message=FALSE, warning=FALSE}
total_licences <- driving_licence %>%
dplyr::as_tibble() %>%
dplyr::group_by(gender) %>%
dplyr::summarize(licences_gender_sum = sum(licences))
total_licences_sum <- sum(total_licences$licences_gender_sum)
total_licences_sum
```
By the end of 2017, there were almost 21 million active driving licences in Poland.
```{r percentage, message=FALSE, warning=FALSE}
M_licences <- total_licences %>%
dplyr::filter(gender == "M") %>%
.$licences_gender_sum
F_licences <- total_licences %>%
dplyr::filter(gender == "F") %>%
.$licences_gender_sum
M_licences_perc <- (M_licences / total_licences_sum)*100
F_licences_perc <- (F_licences / total_licences_sum)*100
```
```{r percentage_print, echo=FALSE}
paste0(M_licences, " /// ", round(M_licences_perc, 2), "%")
paste0(F_licences, " /// ", round(F_licences_perc, 2), "%")
```
Almost 60% of them were held by males.
### Groups and Disproportions
```{r disporportions, echo=FALSE}
driving_licence %>%
dplyr::left_join(total_licences, by = "gender") %>%
dplyr::mutate(
total_share = round( (licences / total_licences_sum)*100, 2),
group_share = round( (licences / licences_gender_sum)*100, 2)) %>%
dplyr::select(-licences_gender_sum, -licences) %>%
knitr::kable()
```
As you can see, there are huge dispropotions between age and gender groups. Within female group 50% of all licences are held by the women aged 25 to 44. As we move towards older age groups, there are less and less licences there. Less than 9% of all licences held by women belong to age group above 65 years old. That's slightly more than 3% of all overall driving licences in 2017.
Male gender group is far more balanced. All age groups except 18-24 share similar percentages, with only little deviations. The biggest difference between gender groups lies in 65< group, as males in this age groups held more than 20% of all licences held by men and 12% of all overall licences.
### Road Accidents by Gender
Numbers below come from Polish Police annual reports on road accidents (p. 26). There were 28 359 accidents caused by drivers in 2017 (p. 23). Male were responsible for 73.5% of them, female - for 22.5%. There is no data for 4% of cases. That's why 27 225 will be considered as total number of accidents.
```{r accidents}
all_accidents <- 28359 # but there is no gender data for 4% of them
M_accidents <- round(all_accidents * 0.735)
F_accidents <- round(all_accidents * 0.225)
all_accidents <- M_accidents + F_accidents
all_accidents
```
I made an assumption that a single driver is responsible for a single car accident.
```{r proportions}
F_accidents / F_licences
M_accidents / M_licences
```
Looking at these proportions can bring us to conclusion that accident rate is higher among male than female. There were around 7 accidents caused by female for every 10 000 females with driving licence. For male it was around 16 accidents for every 10 000 male drivers.
### Chi-Square test
The last part of this analysis was Chi-Square Test for Independence. I've started with frequency table.
```{r frequency_table}
M_no_accident <- M_licences - M_accidents
F_no_accident <- F_licences - F_accidents
accidents <- c(M_accidents, F_accidents)
no_accidents <- c(M_no_accident, F_no_accident)
frequecy_table <- data.frame(
accident = c(F_accidents, M_accidents),
no_accident = c(F_no_accident, M_no_accident)
)
rownames(frequecy_table) <- c("F", "M")
frequecy_table <- as.table(as.matrix(frequecy_table))
frequecy_table
```
Then moved to the test. I wanted to check if chances of causing car accident are dependent on driver's gender. Significance level is set to 5%. Test hypotheses are as follows:
H0: Driver's gender has effect on chances of causing car accident.
H1: Driver's gender does not have effect on chances of causing car accident.
```{r test}
chisq.test(frequecy_table)
```
Since the p-value is very small (< 2.2e-16) and less than the significance level (0.05), we cannot accept the null hypothesis. Thus, we conclude that there is a relationship between gender and chances of causing car accident.
## Conclusion
Number of accidents per 10 000 drivers (16 for male, 7 for female) and chi-square test results lead to conclusion that, generally speaking, male have higher car accident rate and cause more accidents (proportionally and in absolute numbers) than female.
### Other Research regarding this subject
Murat Karacasu, Arzu Er; *An Analysis on Distribution of Traffic Faults in Accidents, Based on Driver's Age and Gender: Eskisehir Case*; Procedia - Social and Behavioral Sciences; Volume 20, 2011, Pages 776-785; available at https://www.sciencedirect.com/science/article/pii/S1877042811014662
Quote:
> Gender is one of the most important factors affecting human behaviors. These behavioral differences are seen in traffic as well; therefore, gender difference is reflected in different attitudes, reflexes and decisions in traffic (Lajunen, 1999).
> Most of the studies performed show that male drivers engage in more accidents than female ones. One reason, of course, is that the number of female drivers is significantly lower than the number of male drivers. However, it is an undeniable fact there is a difference between male and female behavior. Sudden reactions of males to events, their nervousness as well as their desire to “prove” themselves via their driving skills create a basis for accidents (The Social Issues Research Centre, 2004).
> It has been concluded with chi-square significance test that there are significant relationships among genders and fault types, and ages and fault types. It means that there are differences between the faults that male drivers committed and the faults that female drivers committed in traffic.