-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathlondon_gang_network_analysis.Rmd
725 lines (575 loc) · 34.2 KB
/
london_gang_network_analysis.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
---
title: "London Gang Network Analysis"
author: "Stan Ionel Eduard"
date: "February 2019"
output:
html_document: default
pdf_document: default
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
# Introduction
We work on the London-based inner-city street gang data from $2005$ to $2009$, concerning black people, operating from a social housing estate that have co-offended together. Data is available on [UCINET Software website](https://sites.google.com/site/ucinetsoftware/datasets/covert-networks/londongang "London Gang") and comes from anonymised police arrest and conviction data for **all confirmed** members of the gang.
The network is **undirected** and it has $54$ nodes representing the monitored people having the following attributes:
* `Person`, representing the person's **identifier**,
* `Age`, representing the **age** of such person,
* `Birthplace`, representing the **birthplace** of such person ($1$ for West Africa, $2$ for Caribbean, $3$ for United Kingdom, and $4$ for East Africa),
* `Residence`, representing the fact that the person was **resident**,
* `Arrests`, representing the **number of arrests** of such person,
* `Convictions`, representing the **number of convictions** of such person,
* `Prison`, representing the fact that such person had already been **arrested**,
* `Music`, representing the fact that such person **listened to music**,
* `Ranking`, representing a **ranking** of such person.
The ties of such network are given as a **weighted adjacency matrix**, i.e. a $54 \times 54$ (weighted) matrix, where the weights are:
* $1$, representing the fact that the connected people have **hanged-out**,
* $2$, representing the fact that the connected people have **co-offended together**,
* $3$, representing the facts that the connected people have **co-offended together** and that they have **committed serious crimes**,
* $4$, representing the facts that the connected people have **co-offended together**, **committed serious crimes** and that there is a **kin relationship** (i.e. a family relationship) in-between nodes.
In the subsequent analysis, we are going to use **nodes** and **people** interchangeably. The same holds for **edges**, **ties** and **relations**. We use the well-known **tidy approach** to make our analysis.
Our objective is the analysis of such network in terms of **centrality and power of people**. Moreover, we aim to find patterns between subjects based on their birthplace, therefore we focus on the **similarity between nodes** and present some **social validation facts**, thanks to the knowledge that we acquire during the analysis.
## Installation
Before we begin our analysis, first we need to install the following packages:
```{r, eval=F}
install.packages("tidyverse") # core of the tidy approach
install.packages("igraph") # igraph package for graphs
install.packages("ggraph") # ggraph package for graphs
```
and then we load such packages:
```{r}
library(tidyverse)
library(igraph)
library(ggraph)
```
# Data import
We read the CSV files and store such information as `tibble`s:
```{r}
# gather the relations from the CSV file
relations <- as_tibble(read.table("data/gang.csv", header=TRUE, sep=",")) %>%
# select all columns, but not the first one
select(X1:X54)
# gather the people from the CSV file
people <- read_csv("data/attr.csv", col_names=TRUE) %>%
# rename the column X1 as Person
rename(Person=X1)
```
Then we perform a preliminary unit test in order to check that the information is stored as we want:
```{r}
dim(relations) # 54 x 54 (i.e. the weighted adjacency matrix)
class(relations) # print the class of the relations
head(relations) # print the first observations of the relations
dim(people) # 54 x 9 (i.e. 54 people with their own attributes)
class(people) # print the class of the people
head(people) # print the first observations of the people
```
# Graph generation
We need to manipulate the `relations` data frame in order to build a graph. We generate some containers (i.e. `from`, `to`, and `weight`) that store the information about such weighted `relations` as:
```{r}
from <- to <- weight <- NULL # containers
# for each row
for (i in 1:nrow(relations)) {
# for each column
for (j in 1:ncol(relations)) {
if (relations[i,j] != 0) {
# attach an observation for each edge: (from, to, weight)
from <- c(from, i)
to <- c(to, j)
weight <- c(weight, relations[i,j])
}
}
}
```
and create $2$ `tibble`s, one for the `ties` and one for the `nodes`, as:
```{r}
# ties
ties <- tibble(
from = from, # from
to = to, # to
weight = unlist(weight) # weight
)
# nodes
nodes <- tibble(
person = 1:nrow(people), # person
age = people$Age, # age
birthplace = factor(people$Birthplace), # birthplace
residence = factor(people$Residence), # residence
arrests = people$Arrests, # arrests
convinctions = people$Convictions, # convictions
prison = factor(people$Prison), # prison
music = factor(people$Music), # music
ranking = factor(people$Ranking) # ranking
)
```
Now, we are ready to generate our graph using the `igraph` package and to (preliminarly) plot such object as:
```{r}
# create an undirected graph
g <- graph_from_data_frame(ties, directed=FALSE, vertices = nodes)
# plot the graph
plot(g, layout=layout_nicely(g),
vertex.size=15, vertex.color='yellow', vertex.shape="circle",
edge.label=ties$weight, edge.curved=TRUE)
```
Moreover, we can check the attributes of such graph as:
```{r}
attributes(summary(g)) # explore the attributes
```
# Explanatory data analysis (EDA)
In this section we are going to explore our dataset in order to find some preliminary patterns.
In Statistics, a **rule of thumb**, when performing EDA, is to present graphical plots with (numerical) summaries. Therefore, we present such information as:
```{r}
summary(nodes$age) # summary for age
ggplot(nodes, aes(x=age, stat(density))) + # distribution for age
geom_histogram(binwidth=1.2, alpha=2/5) + # histogram
geom_density(color="red", size=1) # density estimate
table(nodes$birthplace) # contingency table for birthplace
ggplot(nodes, aes(x=birthplace)) + # plot such contingency table
geom_bar(aes(fill=birthplace), alpha=2/5) # barplot
table(nodes$residence) # contingency table for residence
ggplot(nodes, aes(x=residence)) + # plot such contingency table
geom_bar(aes(fill=residence), alpha=2/5) # barplot
summary(nodes$arrests) # summary for arrests
ggplot(nodes, aes(x=arrests, stat(density))) + # distribution for arrests
geom_histogram(binwidth=2, alpha=2/5) + # histogram
geom_density(color="red", size=1) # density estimate
summary(nodes$convinctions) # summary for convictions
ggplot(nodes, aes(x=convinctions, stat(density))) + # distribution for convictions
geom_histogram(binwidth=2, alpha=2/5) + # histogram
geom_density(color="red", size=1) # density estimate
table(nodes$prison) # contingency table for prison
ggplot(nodes, aes(x=prison)) + # plot such contingency table
geom_bar(aes(fill=prison), alpha=2/5) # barplot
table(nodes$music) # contingency table for music
ggplot(nodes, aes(x=music)) + # plot such contingency table
geom_bar(aes(fill=music), alpha=2/5) # barplot
table(nodes$ranking) # contingency table for ranking
ggplot(nodes, aes(x=ranking)) + # plot such contingency table
geom_bar(aes(fill=ranking), alpha=2/5) # barplot
```
Suppose that we would like to know:
> How the arrests and convinctions change as a function of birthplace and residence?
We could extract such information as:
```{r}
ggplot(nodes, aes(x=birthplace, y=arrests)) + # arrests as a function of birthplace
geom_boxplot(aes(color=birthplace)) # boxplot arrests vs birthplace
ggplot(nodes, aes(x=birthplace, y=convinctions)) + # convinctions as a function of birthplace
geom_boxplot(aes(color=birthplace)) # boxplot convinctions vs birthplace
ggplot(nodes, aes(x=residence, y=arrests)) + # arrests as a function of residence
geom_boxplot(aes(color=residence)) # boxplot arrests vs residence
ggplot(nodes, aes(x=residence, y=convinctions)) + # convinctions as a function of residence
geom_boxplot(aes(color=residence)) # boxplot convinctions vs residence
```
Note that, although we have a small dataset, we can say that **people from West Africa** (i.e. `birthplace` is equal to $1$) **tend to be more criminals** than others; witnessed by the fact that the arrests and convinctions are (in general) higher. Moreover, **people without a residence tend to be more criminal**, and the reasoning is similar.
Now, suppose that we would like to know:
> How arrests and convinctions change as a function of age?
We can extract such information through a scatterplot in the following way:
```{r}
ggplot(nodes, aes(x=age, y=arrests)) + # arrests as a function of age
geom_point(aes(color=birthplace), size=3) + # scatterplot
geom_smooth() # smooth line through points (with standard error)
ggplot(nodes, aes(x=age, y=convinctions)) + # convinctions as a function of age
geom_point(aes(color=birthplace), size=3) + # scatterplot
geom_smooth() # smooth line through points (with standard error)
```
Note that, we have an expected behaviour, that is, **people tend to be less criminal until a certain age**, in this case $\approx 18$ years old, then they tend to be more criminal. Perhaps this is due the fact that until a "fragile" age, the parents of such individuals assume the full-responsability of their own childs, but then such individuals are grown-up and they are more likely to act as criminals.
# Centrality measures
We can simply add centralities to nodes with the following commands:
```{r}
V(g)$degree <- degree(g) # degree centrality
V(g)$betweenness <- betweenness(g) # betweenness centrality
V(g)$closeness <- closeness(g) # closeness centrality
V(g)$pr <- page_rank(g)$vector # pagerank centrality
attributes(summary(g)) # investigate the (new added) attributes
```
In order to use the `dplyr` package, which is already included by the `tidyverse`, we add the same information to the `nodes` as:
```{r}
nodes<- nodes %>%
mutate(
degree = degree(g), # degree centrality
betweenness = betweenness(g), # betweeness centrality
closeness = closeness(g), # closeness centrality
pr = page_rank(g)$vector) # pagerank centrality
```
Let's start with a **simple question**:
> Which is the most central person w.r.t. the centrality measures?
To answer this question, we sort in a decreasing fashion the nodes by centralities as:
```{r}
nodes %>%
select(person,degree,betweenness,closeness,pr) %>%
arrange(desc(degree)) # arrange by degree centrality (decreasing order)
nodes %>%
select(person,degree,betweenness,closeness,pr) %>%
arrange(desc(betweenness)) # arrange by betweenness centrality (decreasing order)
nodes %>%
select(person,degree,betweenness,closeness,pr) %>%
arrange(desc(closeness)) # arrange by closeness centrality (decreasing order)
nodes %>%
select(person,degree,betweenness,closeness,pr) %>%
arrange(desc(pr)) # arrange by pagerank centrality (decreasing order)
```
Note that, for this network, more or less, all centralities agree that node $1$ is the most central. Therefore, we can point out such node in the network as:
```{r}
V(g)$color <- "white" # set the color for all nodes to white
V(g)[1]$color <- "red" # set the color for node 1 to red
plot(g, layout=layout_with_fr(g), vertex.label=NA, vertex.size = 5)
```
We can use the `ggraph` package to plot a beautiful graph, where the edges are more or less transparent based on their weights (`weight`) and the nodes are resized based on their PageRank (`pr`) centrality:
```{r}
ggraph(g, layout = "with_kk") +
geom_edge_link(aes(alpha = weight)) + # edge link geom, mapping alpha to weight
geom_node_point(aes(size = pr)) + # node point geom, mapping size to pagerank centrality
theme_graph(base_family="sans")
```
or perhaps a more intuitive plot is the following:
```{r}
ggraph(g, layout = "with_kk") +
geom_edge_link(aes(color = factor(weight))) + # edge link geom, mapping color to weight
geom_node_point(aes(size = pr)) + # node point geom, mapping size to pr centrality
facet_edges(~factor(weight)) # facet geom as function of weight
```
where we point out the relations between nodes. For instance, if `weight` is $4$, then the link means that the connected nodes have co-offended together, committed serious crimes and there is a kin relationship (bottom, right); this brings us to the next question:
> What is the proportion of weak ties in the network?
Weak ties are very important for several reasons. For instance, strengthening the weak ties may lead to better opportunities. In our case, one will be more likely to commit crimes with others (**which is not so good, isn't it?**). We can compute such proportion in a straightforward way:
```{r}
ties %>%
group_by(weight) %>% # group by weight
summarise(n = n(), proportion = n / nrow(ties)) %>% # compute summary
arrange(desc(n)) # sort in decreasing order by n
```
It seems that $\approx 58\%$ of the people in the network have hanged-out with each other: perhaps **hanging out with the "wrong" people may lead people to commit (serious) crimes**. Therefore, reasoning by complementation, there are $\approx 42\%$ of the people that have done some crimes together.
# Connection between nodes
Finding connection patterns between nodes is perhaps one of the **most interesting things** that an analyst would like to explore. We would like to given an answer the following question:
> What is the proportion of similar and dissimilar people?
To answer this question, we use the **Pearson correlation coefficient** to detect similarities. In order to compute such metric, we extract the adjacency matrix from our graph `g` and compute the correlation matrix as:
```{r}
A <- as_adjacency_matrix(g, attr = "weight", names = FALSE) # get adjacency matrix
S <- cor(as.matrix(A)) # compute the correlation matrix
diag(S) <- 0 # set the diagonal to 0
```
Given `S`, we can plot the distribution of the similarities in the following way:
```{r}
V <- as.vector(S) # flatten S to a vector
ggplot(as.tibble(V), aes(x=value)) +
geom_histogram(aes(alpha=2/5), bins=30, show.legend=FALSE) + # histogram
geom_freqpoly(aes(color="red"), size=1, bins=30, show.legend=FALSE) # frequency polygon
```
where $1$ means positive similarity, $-1$ means negative similarity (or dissimilarity), and $0$ means no correlation at all
Note that, the above plot points out a **positive skewness** (or, alternatively called, **right skewness**). To answer the question, we compute the proportion of similar (dissimilar) nodes whose similarity is greater (lower or equal) than $0$ as:
```{r}
cnt <- 0 # initialize a counter
# for each observation
for (i in 1:length(V)) {
# if the similarity is greater than 0, then we increment the counter
if (V[i] > 0) cnt <- cnt + 1
}
prop_sim <- cnt/length(V) # compute the proportion of similar nodes
print(prop_sim) # print such proportion
print(1-prop_sim) # print the 1-complementary proportion (i.e. dissimilar proportion)
```
Not surprisingly, there is a **lot of similarity** between nodes, and this is an expected behavior within this context. Moreover, let's try to give an answer to the following question:
> Which are the most similar people?
To answer this question, we can directly use the computed similarity matrix `S` by generating a weighted (undirected) graph `simil_graph` from it, where the weights are the similarity scores, then compute a `tibble`, named `simil_edges`, where each row corresponds to an edge with its similarity score, with the following commands:
```{r}
# create the similarity graph from the similarity matrix S
simil_graph <- graph_from_adjacency_matrix(S, mode = "undirected", weighted = TRUE)
# extract the (weighted) edges from such similarity graph
simil_edges <- as.tibble(as_data_frame(simil_graph, what = "edges")) %>%
mutate(from = as.integer(from), to = as.integer(to)) %>%
arrange(desc(weight))
```
Finally, we are ready to answer the question. In order to give a more **realistic answer**, we filter only those nodes whose degree is greater or equal than $5$, then we print the highest similar nodes:
```{r}
# compute the joint similarities for the people whose degree is greater or equal to 5
joint_simil <- simil_edges %>%
left_join(nodes, c("from" = "person")) %>%
left_join(nodes, c("to" = "person")) %>%
filter(degree.x >= 5, degree.y >= 5) %>%
select(from, to, weight)
joint_simil # print the most similar nodes
```
To have an intuitive idea on the similarity in the original network, we can filter out the most similar pairs and then plot the graph generated for such pairs as:
```{r}
# filter those pairs with similarity higher than 0.6
filtered_pairs <- joint_simil %>%
filter(weight > 0.6)
# create a similarity network with the filtered pairs
similarity_network <- graph_from_data_frame(filtered_pairs, directed=FALSE)
# get the nodes (in order to plot their id)
simil_nodes <- as.tibble(as_data_frame(similarity_network, what = "vertices"))
# plot the result
ggraph(similarity_network, layout="with_kk") +
geom_edge_link(aes(alpha=weight), edge_width=1) + # edge link, mapping alpha to weight
geom_node_text(aes(label=simil_nodes$name), repel=TRUE) + # node text, mapping label to nodes' id (i.e name)
theme_graph(base_family="sans")
```
> Does the similarity between the most similar people changes as a function of their birthplace?
In order to calculate such score, we need to create a `tibble`, named `pairs_with_bp`, that is the join between the `filtered_pairs`, which we have calculated before, and `nodes_bp`, which are the most similar nodes projected only on their `name` and `birthplace`. Given `pairs_with_bp`, we compute the metric of interest (i.e. the proportion of similar people as a function of their birthplace) in the following way:
```{r}
# convert the name of the similar nodes from the similarity network to an integer (i.e. its id)
simil_nodes$name <- as.integer(simil_nodes$name)
# create a tibble of (similar) nodes with their birthplace
simil_nodes_with_bp <- simil_nodes %>%
left_join(nodes, by=c("name" = "person")) %>%
select(name, birthplace)
# create a joint tibble of the most similar pairs with their birthplace
# recall that filtered_pairs are those pairs whose similarity is greater than 0.6
pairs_with_bp <- filtered_pairs %>%
left_join(simil_nodes_with_bp, c("from" = "name")) %>%
left_join(simil_nodes_with_bp, c("to" = "name"))
# compute the proportion of similar people as a function of their birthplace
cnt <- 0 # initialize a counter
# for each pair
for (i in 1:nrow(pairs_with_bp)) {
# if their birthplace is the same, then increment the counter
if(pairs_with_bp$birthplace.x[i] == pairs_with_bp$birthplace.y[i])
cnt <- cnt + 1
}
# print the proportion of similar nodes as a function of their birthplace
cnt/nrow(pairs_with_bp)
```
This behaviour can be checked by the following plot:
```{r}
# generate the same similarity network augmented with the nodes' birthplace
similarity_network_with_bp <- graph_from_data_frame(filtered_pairs,
directed=FALSE,
vertices=simil_nodes_with_bp)
ggraph(similarity_network_with_bp, layout="with_kk") +
geom_node_point(aes(color=factor(birthplace)), size=3) + # geom node point, mapping color to birthplace
geom_edge_link(aes(alpha=weight), edge_width=1) + # geom edge link, mapping alpha to weight
geom_node_text(aes(label=simil_nodes$name), repel=TRUE) + # geom node text, mapping label to id
theme_graph(base_family="sans")
```
The same behaviour is also empirically proven by the following (bottom-up) **clustering analysis** (an unsupervised learning algorithm that groups people together based on their similarities). The analysis is the following:
```{r}
# distance matrix from similarity matrix
D <- 1-S
# distance object from distance matrix
d <- as.dist(D)
# perform an average-linkage clustering method
cc <- hclust(d, method="average")
# plot the dendrogram of such clustering
plot(cc)
# cut the dendrogram at 10 clusters and color such clusters
clusters.list = rect.hclust(cc, k = 10, border="red")
# cut dendrogram at 10 clusters
clusters = cutree(cc, k = 10)
# plot the graph with nodes colored based on their cluster membership
plot(g, vertex.color=clusters, layout=layout_with_fr(g))
```
Note that, similar nodes, from the previous analysis, are clustered together. Perhaps a more stilish plot, containing the same information, is the following:
```{r}
# extract dendogram
dendrogram = as.dendrogram(cc)
# plot a circular dendrogram
ggraph(dendrogram, layout = 'dendrogram', circular = TRUE) +
geom_edge_diagonal() +
geom_node_text(aes(filter = leaf, label = label, x = x*1.03, y=y*1.03), size = 3) +
theme_graph(base_family="sans")
```
Now, we would like to answer the **more challenging question**:
> Is it true that even if people know more people tend to commit crimes with those that belong to their own birthplace?
Note that, in the above analysis, we have filtered only those nodes with degree greater or equal to $5$. To answer this interesting question, basically we redo the same analysis, but we filter on nodes with degree greater or equal to $15$:
```{r}
# compute the joint similarities
joint_simil2 <- simil_edges %>%
left_join(nodes, c("from" = "person")) %>%
left_join(nodes, c("to" = "person")) %>%
filter(degree.x >= 15, degree.y >= 15) %>%
select(from, to, weight)
# filter those pairs with similarity higher than 0.6
filtered_pairs2 <- joint_simil2 %>%
filter(weight > 0.6)
# create a similarity network with the filtered pairs
similarity_network2 <- graph_from_data_frame(filtered_pairs2, directed=FALSE)
# get the nodes
simil_nodes2 <- as.tibble(as_data_frame(similarity_network2, what = "vertices"))
# convert the name to an integer (i.e. its id)
simil_nodes2$name <- as.integer(simil_nodes2$name)
simil_nodes_with_bp2 <- simil_nodes2 %>%
left_join(nodes, by=c("name" = "person")) %>%
select(name, birthplace)
# create a joint tibble of the most similar pairs with their birthplace
pairs_with_bp2 <- filtered_pairs2 %>%
left_join(simil_nodes_with_bp2, c("from" = "name")) %>%
left_join(simil_nodes_with_bp2, c("to" = "name"))
# compute the proportion of similar people as a function of their birthplace
cnt2 <- 0
for (i in 1:nrow(pairs_with_bp2)) {
if(pairs_with_bp2$birthplace.x[i] == pairs_with_bp2$birthplace.y[i])
cnt2 <- cnt2 + 1
}
# print such proportion
cnt2/nrow(pairs_with_bp2)
```
Again, the behaviour can be checked by the following plot:
```{r}
similarity_network_with_bp2 <- graph_from_data_frame(filtered_pairs2,
directed=FALSE,
vertices=simil_nodes_with_bp2)
ggraph(similarity_network_with_bp2, layout="with_kk") +
geom_node_point(aes(color=factor(birthplace)), size=3) +
geom_edge_link(aes(alpha=weight), edge_width=1) +
geom_node_text(aes(label=simil_nodes2$name), repel=TRUE) +
theme_graph(base_family="sans")
```
Hence, we claim that it is true that **people that have higher acquintances tend to perfom crimes together with those that are ethnically similar to them**.
# Power analysis
We are interested in comparing the centrality of people against their (PageRank) power. In particular, we would like to identify people that are central and powerful. In order to perform such analysis, we define the following function (powered by Bozzo, Enrico and Franceschet, Massimo) that computes the power of people:
```{r}
# Compute power x = (1/x) A
#INPUT
# A = graph adjacency matrix
# t = precision
# OUTPUT
# A list with:
# vector = power vector
# iter = number of iterations
power = function(A, t) {
n = dim(A)[1];
# x_2k
x0 = rep(0, n);
# x_2k+1
x1 = rep(1, n);
# x_2k+2
x2 = rep(1, n);
diff = 1
eps = 1/10^t;
iter = 0;
while (diff > eps) {
x0 = x1;
x1 = x2;
x2 = (1/x2) %*% A;
diff = sum(abs(x2 - x0));
iter = iter + 1;
}
# it holds now: alpha x2 = (1/x2) A
alpha = ((1/x2) %*% A[,1]) / x2[1];
# hence sqrt(alpha) * x2 = (1/(sqrt(alpha) * x2)) A
x2 = sqrt(alpha) %*% x2;
return(list(vector = as.vector(x2), iter = iter))
}
```
Such function computes the power of the people with an interative fashion. Therefore, we compute such powers as:
```{r}
damping = 0.15 # damping factor
I <- diag(damping, vcount(g)) # identity matrix (with the damping factor on its diagonal)
Ad <- A + I # A + I
p <- power(Ad, 6)$vector # compute power
V(g)$power <- p # assign power to the nodes
nodes <- nodes %>% # mutate the tibble nodes adding the new attribute power
mutate(power = p)
nodes %>% # sort the nodes in decreasing order by their pagerank centrality
select(person, pr) %>%
arrange(desc(pr))
nodes %>% # sort the nodes in decreasing order by their power score
select(person, power) %>%
arrange(desc(power))
```
We can also compute the correlations between such scores as:
```{r}
cor(nodes$pr, nodes$power) # Pearson correlation coef.
cor(nodes$pr, nodes$power, method="spearman") # Spearman rank correlation coef.
cor(nodes$pr, nodes$power, method="kendall") # Kendall tau correlation coef.
```
> Who are the most central and powerful people?
To answer this question, we plot such people and create a `tibble`, named `power_central_nodes`, as:
```{r}
mc <- quantile(nodes$pr, 0.75) # compute the 0.75 pagerank centrality quantile
mp <- quantile(nodes$power, 0.75) # compute 0.75 power measure quantile
# scatterplot : pagerank centrality vs power
plot(nodes$power, nodes$pr, xlab="power", ylab="PageRank centrality", pch=NA, bty="l")
text(nodes$power, nodes$pr, labels=nodes$person, adj=c(0.5,0.5), cex=1, col = "blue")
abline(h=mc, lty=3) # add horizontal (dotted) line
abline(v=mp, lty=3) # add vertical (dotted) line
power_central_nodes <- nodes %>% # compute the most powerful and central people
filter(power >= mp & pr >= mc) %>%
select(person, age, birthplace, pr, power)
power_central_nodes # print such people
```
> Based on their similarities, power and (PageRank) centrality, what is the proportion of people that have a parental relationship?
To get the most powerful and central nodes, based on the similarity between them, we can compute them easily as:
```{r}
simil_power_central_nodes <- power_central_nodes %>% # compute similar, powerful and central nodes
inner_join(simil_nodes, c("person" = "name"))
simil_power_central_nodes # print such people
```
Recall our similarity network graph, where we have restricted ourselves on the people who have at least $5$ acquintances:
```{r}
ggraph(similarity_network_with_bp, layout="with_kk") +
geom_node_point(aes(color=factor(birthplace)), size=3) +
geom_edge_link(aes(alpha=weight), edge_width=1) +
geom_node_text(aes(label=simil_nodes$name), repel=TRUE)
```
Note that, given the graph and the `simil_power_central_nodes` structure, the person $3$ is isolated from the others, therefore we can exclude such person from the analysis. To answer our question, we can do the following:
```{r}
# remove the node 3
simil_power_central_nodes <- simil_power_central_nodes %>%
filter(person != 3)
simil_power_central_nodes # print the most similar, powerful and central nodes (without 3)
# compute the edges between such nodes
# NOTE: since the graph is undirected we have only a pair for each node
ee <- filtered_pairs %>%
inner_join(simil_power_central_nodes, c("from" = "person")) %>%
inner_join(simil_power_central_nodes, c("to" = "person"))
ee_f <- as.vector(ee$from) # flatten "from" to a vector
ee_t <- as.vector(ee$to) # flatten "to" to a vector
# get the original weight for such pairs
# NOTE: in the similarity graph that we have computed before
# the weights of the edges are the similarity scores between nodes
filtered_pairs %>%
inner_join(ties, c("from" = "from", "to" = "to")) %>%
filter(from %in% ee_f, to %in% ee_t) %>%
select(from, to, weight.y) %>%
rename(weight = weight.y)
```
As you can see, not surprisingly, **people that are similar, powerful and central** (w.r.t the measures computed before) **are more likely to stick with their own family**, given in terms as the weight on the relation between such nodes (i.e. $4$ in this case); this answers our question, that is, we have a $100\%$ proportion of such subjects.
# Model
Suppose now that we would like to make predictions on individuals. For instance, we could try to classify people based on the fact that they have been arrested or not (i.e. `prison=1` or `prison=0`). Therefore, we can formalize a **classification problem**. Briefly, the classification setting lies within the statistical learning framework, where for each observation (the people, in our case) has a label (i.e. the fact that they have been arrested or not, in our aforementioned example). Moreover, since we have labels for each observation, such problem is said to be a **supervised learning problem**.
Hereby, we briefly analyze a fitted **generalized linear model**:
```{r}
# create a data frame
d <- as.data.frame(people)
# convert the Prison attribute to a factor
d$Prison <- as.factor(d$Prison)
# fit a generalized linear model
mod1 <- glm(Prison~Age+Birthplace+Arrests+Convictions+Ranking+Music+Residence,
d, family=binomial)
# analyze the fitted model
summary(mod1)
```
Note that, we have a lot of $p$-values for the $z$-test statistics that are very high (i.e. greater than $0.05$), this means that not all regressors, namely the attributes, are statistically significant. In Statistics, an approach to fit a better model is to remove, one at the time, those regressors with highest $p$-values for the test statistic, and then to refit a (new) model with the remaining ones. We skip such (headache) analysis, and we have decided to present the last model:
```{r}
# fit a better model
mod2 <- glm(Prison~Arrests, d, family=binomial)
# analyze the fitted model
summary(mod2)
```
This model is better since the AIC score is lower. Moreover, we can visually check this behaviour by fitting a **decision tree**, using the `rminer` package, with the following commands:
```{r}
# load rminer package
library(rminer)
# fit a decision tree with all regressors (i.e. attributes)
M <- fit(Prison~Age+Birthplace+Arrests+Convictions+Ranking+Music+Residence,
data=d,
model="rpart")
# print the model
print(M@object)
# plot the (compressed) model
plot(M@object,uniform=T,compress=T)
# put "fancy" text on the model (the nodes) with additional options for a better view
text(M@object,fancy=T,xpd=T,fwidth=0.1,fheight=0.2)
# predict on the training set (i.e. the same data)
P <- predict(M,d)
# confusion matrix
print(mmetric(d$Prison,P,"CONF")$conf)
# all performances (accuracy ACC, classification error CE, true positive rate TPR, etc.)
print(mmetric(d$Prison,P,"ALL"))
# ROC curve analysis for target class (TC) 1, i.e. arrested
mgraph(d$Prison,P,graph="ROC",TC=1,main="arrested ROC",
baseline=TRUE,leg="Arrested",Grid=10)
```
# Conclusions and discussion
In this work, we have analyzed a gang network, concerning black people, who performed crimes in London, during a police observation-time from $2005$ to $2009$. We have downloaded the (anonymised) data, then we have built a graph from it. Each self-respecting statistical analysis has to perform a preliminary Explanatory Data Analysis (EDA), which we have done briefly.
We then moved on with the centrality analysis of the network and have pointed out some (preliminary) interesting facts. For instance, the network is very well tied, that is, people have hanged-out together and more than $40 \%$ have commited crimes together, either if they belong to the same family or not.
In our opinion, the most interesting part of the presented analysis is relative to the part of the connection patterns between nodes. The criminality analysis was driven by the birthplace, and very important empirical proofs have outcomed. For instance, very connected people tend to act as criminals with their own people, that is, with those that belong to their own birthplace. Moreover, strongly related people that are powerful and central also tend to commit crimes along with their families, and this is very natural, if we think, for example, at the Italian Mafia.
For the sake of completeness, we have presented a short introduction on how one could fit a generalized linear model in order to classify instances. The scope of this work was not centered on such type of analysis, although one could investigate more on such aspects. For instance, one could fit a linear model in order to model a regression problem, or could try to use `rminer`'s arsenal of models in order to dig deeper into the model playground.
We have achieved our goals proving that network science analysis is a powerful tool to investigate on the peculiarities of networks. For example, based on our case-study, the presented analysis could be explored by the police in order to catch the bad guys.