-
Notifications
You must be signed in to change notification settings - Fork 3
/
Copy path02_multivariable-plots.Rmd
289 lines (172 loc) · 10.1 KB
/
02_multivariable-plots.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
---
title: "R Notebook"
output: html_notebook
editor_options:
chunk_output_type: inline
---
# Introduction: Multivariable plots
Multivariable plots allow us to examine the relationships between two or more variables. Or we an examine aspects of interest with univariable plots but by different levels of another variable. Ultimately, we are able to start to see the relationships between several variables at the same time.
We are still interested in some of the same aspects of the data as with univariable plots:
+ Location of observations -- where do they fall; how large or small are values.
+ Dispersion or spread -- how spread out they are or how close together.
+ Distribution -- how often/frequently do some values occur.
# Packages
R packages are collections of functions and data sets developed by the open source community to build and expand R's functionalities. (http://www.sthda.com/english/wiki/installing-and-using-r-packages)
The packages that we intend to use today have been pre-installed for you, but we do need to load them in order to use them in our work.
```{r load-packages}
library(tidyverse) # Contains packages: ggplot, dplyr, readr, and the %>% operator
library(readxl) # For importing excel files into R.
```
# Import the data
For the following exercises we will be using a data set collected to determine factors associated with myopia (near-sightedness) in children.
Myopia (near-sightedness or short-sightedness) is the most common eye problem and is estimated to affect 1.5 billion people (22% of the population). It is a condition of the eye where light focuses in front, instead of on the retina. This causes distant objects to be blurry while close objects appear normal. Other symptoms may include headaches and eye strain. Severe near-sightedness increases the risk of retinal detachment, cataracts, and glaucoma.
The data consist of physiological variables (age, gender, eyeball parameters), environmental variables (time spent on near-work and outdoor activities) as well as hereditary variables (myopic mother and father).
```{r read-in-the-data}
myopia <- readxl::read_excel(
path = "/cloud/project/myopia-data.xlsx",
sheet = "myopia")
```
After importing data, it is a good idea to take an initial "glimpse" and check that the data you imported fits with what you expected to see.
```{r glimpse-the-data}
dplyr::glimpse(myopia)
```
From this command we get a sense of:
+ How many variables/columns are in our data set: 18
+ How many observations/rows there are: 618
+ The names of the variables
+ The variable types: mix of numeric (dbl) and text (chr)
+ And we get to see what some of the values are.
# Categorical variables
We will only examine one type of chart for categorical variables: the bar chart. A bar chart is one of the safest and most easily understood ways to plot categorical data.
## Bar plot of counts
In the univariable exercises, we examined the number of subjects in the data set that had or did not have myopia. We may want to know if more female chilren have mypopia than male children.
As before, we want to use `geom_bar` to make a bar plot, but we need to add to the aesthetic mapping. We define a `fill` aesthetic which tells ggplot how to group the data, but also "fills" in the color.
```{r bar-plot-counts}
ggplot(data = myopia,
mapping = aes(x = myopic,
fill = gender)) +
geom_bar(width = 0.5)
```
We can see which subjects are female and which are male but it's a little hard to get a sense of how they compare. Sometimes its better to place the bars next to each other.
To do this, we add an additional argument to `geom_bar` to change the positions of the bars: `position = position_dodge()`.
```{r dodged-bar-plot-counts}
ggplot(data = myopia,
mapping = aes(x = myopic,
fill = gender)) +
geom_bar(position = position_dodge(),
width = 0.5)
```
Now we can more easily see there are more males than females without myopia, and there are more females than males with myopia.
## Exercise 1
Let's flip the question around. What if we want to see more readily compare the number of females with and without myopia and the number of males with and without myopia.
Hint: take the example above and switch variables given to the `x` aesthetic and the `fill` aesthtic.
Challenge:
+ Show the plot using proportions.
+ Explore myopia by the other categorical variables: `mommy` or `dadmy`.
```{r exercise-1-template, eval=FALSE}
# For bar chart of number or count
ggplot(data = <data>,
mapping = aes(x = <x-variable>,
fill = <fill-variable>)) +
<geom>
```
```{r exercise-1}
```
### Saving
To save your plot, ggplot does make this easy for you to do with the command `ggplot2::ggsave`. When you give this command ggplot will default to the "last plot that you made". To learn more later explore the help with `?ggplot2::ggsave`.
Make sure to give your plot a meaningful name and a file extension which will determined the type of file to save. Examples:
+ "myopic-barplot.png"
+ "myopia-counts.tiff"
+ "my-bars.png"
```{r}
ggplot2::ggsave(filename = "multi-exercise-1-plot.png")
```
# Continous variables
We can also explore continuous variables and their relationship with an addtional variable in similar ways using ggplot. In this section we will revisit the gplts we saw in the univariable exercises:
+ Histograms (`geom_histogram`)
+ Density plots (`geom_density`)
+ Box plots (`geom_boxplot`)
And we will add two other ways to compare continuous variables
+ Jittered plot (`geom_jitter`)
+ Scatter plot (`geom_point`)
## Histograms
Similar to the bar plots, we can add an additional variable to a histogram using a `fill` aesthetic.
Let's see how Anterior chamber depth (`acd`) might differ between those with and without myopia. I use the same binwidth to start that we used in the univariable exercises.
```{r histogram-acd}
ggplot(data = myopia,
mapping = aes(x = acd,
fill = myopic)) +
geom_histogram(binwidth = 0.10)
```
It's interesting to see that the bimodal distribution remains for those without myopia, but we only see one peak for those with mypoia.
## Density plots
Let's try a density plot to compare the distribution of `acd` by `myopic` status. Here I have added an additional argument to the `geom_density` call that let's us see both curves. The `alpha` argument controls how transparent the fill color is.
```{r density-acd}
ggplot(data = myopia,
mapping = aes(x = acd,
fill = myopic)) +
geom_density(alpha = 0.5)
```
We can see similar information to what we saw with the histogram but now shown as a smooth curve. Note that a density plot can also mislead if not careful by showing a smooth continuous distribution when the data may actually be more sparse. It helps to understand the underlying data to guage when one plot might provide better conclusions than the other.
## Boxplots
Once again, let's compare `acd` by `myopic` status but this time using a box plot.
You might recall in the univariable exercises that we provided a place holder to the `x` aesthetic, here we will give a real variable `myopic` (no quotes)
```{r boxplot-acd}
ggplot(data = myopia,
mapping = aes(x = myopic,
y = acd)) +
geom_boxplot(width = 0.5)
```
## Jittered plot
A jittered plot is very similar to a box plot, but it allows you to see the inividual data points rather than the summary that a box plot shows. This can be a powerful and informative plot to use, that lends itself well to smaller data sets.
Syntax is nearly identical to the boxplot but we will use `geom_jitter`
```{r jitterplot-acd}
ggplot(data = myopia,
mapping = aes(x = myopic,
y = acd)) +
geom_jitter(width = 0.25)
```
## Scatter plot
Up to this point, we have examined bi-variable plots that visualize a contiuous variable by a categorical variable. With a scatter plot, you can compare two continous variables and see what type of relationship might exist between the variables.
Let's plot the relationship between Anterior Chamber Depth (`acd`) and Vitreous Chamber Depth (`vcd`).
```{r scatter-plot}
ggplot(data = myopia,
mapping = aes(x = acd,
y = vcd)) +
geom_point()
```
## Exercise 2
Select two variables from the data set and create one of the bivariable plots above. Use the command `dplyr::glimpse(myopia)` to see the variable that are available in our data set.
Be sure to select the appropiate plot type whether you are comparing
+ two categorical variables (bar plot),
+ two continous variables (scatter plot), or
+ One continuous and one categorical (histogram, density, boxplot, jitter)
I suggest to avoid the variables that end in "hr". The hours per week variables are intteger values and are a little funky to try to plot with these methods. In practice, they may be treated as categorical variables or grouped into meaningful categories.
Some ideas:
+ Boxplot (or jitter plot) of `acd` by `age` (note: use `factor(age)` not `age`)
+ Select a plot to compare `acd` (continuous variable) with `mommy` or `dadmy` (categorical)
+ Jitter plot of `age` on the y-axis and `gender` on the x-axis (use `geom_jitter(width = 0.15, height = 0.15)`)
Challenge: try working with the `hr` variables.
```{r exercise-2-template, eval=FALSE}
# histogram / density
ggplot(data = <data>,
mapping = aes(x = <x-variable>,
fill = <fill-variable>)) +
<geom>
# boxplot / jitter / scatter
ggplot(data = <data>,
mapping = aes(x = <x-variable>,
y = <y-variable>)) +
<geom>
```
```{r exercise-2}
```
### Saving
To save your plot, ggplot does make this easy for you to do with the command `ggplot2::ggsave`. When you give this command ggplot will default to the "last plot that you made". To learn more later explore the help with `?ggplot2::ggsave`.
Make sure to give your plot a meaningful name and a file extension which will determined the type of file to save. Examples:
+ "myopic-barplot.png"
+ "myopia-counts.tiff"
+ "my-bars.png"
```{r}
ggplot2::ggsave(filename = "multi-exercise-2-plot.png")
```