This repository has been archived by the owner on Jan 6, 2022. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 29
/
Copy pathsparrow925.Rmd
153 lines (120 loc) · 4.41 KB
/
sparrow925.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
---
title: <span style="color:magenta">**Elise's Cool Website**</span>
author: "Elise"
date: "January 18, 2056"
output: html_document
---
```{r SetExpectations, echo=F, results='hide'}
# set working directory if has child directory
dir_child = 'students'
if (dir_child %in% list.files()){
if (interactive()) {
# R Console
setwd(dir_child)
} else {
# knitting
knitr::opts_knit$set(root.dir=dir_child)
}
}
print(getwd()) # Is this in env-info/students?
suppressPackageStartupMessages({
library(readr)
library(dplyr)
library(tidyr)
})
```
### **Content**
I am interested in environmental remediation: pollution modeling and the data analysis that comes with that are my jam. I also think that the tools we are learning in this class are useful in the corporate setting -- performance metrics, etc. could all be run through tools like this.
I also have a [blog](www.elisewall.com) where I post projects and classwork I'm particularly passionate about.

***
## Assignment 3
### **Data Wrangling**
####How does using read_csv from the readr package improve data readability?
Using read.csv (the default) is pretty messy and unhelpful
```{r read.csv, eval=F}
d = read.csv('./data/sparrow925_EUSO2.csv')
head(d)
summary(d)
```
Compare that to using read_csv, which gives you usefull info!
```{r read_csv}
# Oh man the head and summary here are so much more usable
d2 = read_csv('./data/sparrow925_EUSO2.csv')
head(d2)
# summary(d2)
```
Now, lets use dplyr.
Let's ask ourselves, how many cities are measured for each country?
#### Using Multiple Variables (The Old Way)
```{r messyway, eval=F}
# read in csv
d = read_csv('./data/sparrow925_EUSO2.csv')
# limit columns to country and city
d2=d[,c('country_code','city_name')]
# get count per country
d3=aggregate(city_name ~ country_code, data=d2, FUN='length')
# write out csv
write.csv(d3, './data/sparrow925_EUSO2_citycount.csv', row.names = FALSE)
```
#### Using dplyr to Write Compact Code
```{r cleanway, eval=F}
# read in csv
sparrow925_EUSO2_dplyr = read_csv('./data/sparrow925_EUSO2.csv')
sparrow925_EUSO2_dplyr %T>%
glimpse() %>% # view
select(country_code, city_name) %>% # limit columns
group_by(city_name) %>% # get count by grouping
summarize(n=n()) %>% # then summarize
write.csv('./data/sparrow925_EUSO2_citycount.csv')
```
***
## 4. Tidying up Data / Answers and Tasks
_**Task 1**. Convert the following table [CO<sub>2</sub> emissions per country since 1970](http://edgar.jrc.ec.europa.eu/overview.php?v=CO2ts1990-2014&sort=des9) from wide to long format and output the first few rows into your Rmarkdown._
```{r Assn4, eval=F}
library(readxl) # install.packages('readxl')
url = 'http://edgar.jrc.ec.europa.eu/news_docs/CO2_1970-2014_dataset_of_CO2_report_2015.xls'
xls = '../data/co2_europa.xls'
# This download seems to corrupt on PCs... look at download.file helpfile, maybe try another method? For now we'll use a workaround I guess.
print(getwd())
if (!file.exists(xls)){
download.file(url, xls)
}
```
Quick workaround (not downloading from online)
```{r}
library(readxl) # install.packages('readxl')
library(readr)
library(dplyr)
library(tidyr)
xls = '../data/co2_europa.xls'
co2=read_excel(xls, skip=12)
#Convert to long format. Display a few rows as example.
task1 = co2 %>%
gather("year", "n", -1)
head(task1)
```
_**Task 2**. Report the top 5 emitting countries (not World or EU28) for 2014 using your long format table. (You may need to convert your year column from factor to numeric, eg `mutate(year = as.numeric(as.character(year)))`._
```{r}
task2=co2 %>%
gather("year", "n", -1) %>%
mutate(year = as.numeric(as.character(year))) %>%
arrange(desc(n)) %>%
filter (year==2014) %>%
filter(Country != "World") %>%
filter(Country != "EU28")
head(task2)
```
_**Task 3**. Summarize the total emissions by country (not World or EU28) across years from your long format table and return the top 5 emitting countries. (As with most analyses, there are multiple ways to do this. I used the following functions: `filter`, `arrange`, `desc`, `head`)_.
```{r}
task3=co2 %>%
gather("year", "n", -1) %>%
mutate(year = as.numeric(as.character(year))) %>%
group_by(Country) %>%
summarise(Sum = sum(n)) %>%
arrange(desc(Sum)) %>%
filter(Country != "World") %>%
filter(Country != "EU28")
colnames(task3)[colnames(task3)=="Sum"] <- "Sum of CO2 Emissions, 1970-2014 (kton CO2)"
head(task3)
```