students/jkidson.Rmd

---
title: "R Markdown Assignment, Week 2-4"
author: "Jennifer Kidson"
date: "January 15, 2016"
output:
  html_document:
    toc: true
    toc_depth: 2
---

#### Managing the working directory in R markdown
```{r setwd students, eval=T, echo=T, include=T}
# set working directory if has child directory
dir_child = 'students' 
if (dir_child %in% list.files()){
  if (interactive()) {  
    # R Console
    setwd(dir_child)
  } else {              
    # knitting
    knitr::opts_knit$set(root.dir=dir_child)  
  }
}
```

## 1. Reproducible Science Tools
and 
## 2. Programming Concepts
### Content
        
For this class, the group will be working to create a program to take in survey data in real time, reorganize it, and output summary statistics. For my group project (on Santa Barbara water supply) and future work, I'm most interested in learning data management strategies and ways to visualize data. 
        
### Techniques
        
1. Wrangling data
2. Cleaning up data
3. Visualizing data
    + **interactive maps**
    + *interactive time series*
        
### Data
        
A group member has data on survey responses on solid waste. Data is received via text message and associated timestamps. The responses need to be tied to a specific survey question and organized.
  
### Image
![](images/jkidson_age_hist.png)

<!--- hist(ugandasms$age, main="Age Distribution of Survey Respondents", xlab="Age (Years)", xlim=c(0, 80))-->

Fig. 1. Histogram of age distribution of survey respondents.

### Summary of Survey Response Data

```{r read csv traditional}
# read csv
d = read.csv('data/jkidson_ugandasms.csv')
      
# output summary
summary(d)
```

### [Citizen Monitoring Github Page](https://github.com/citizen-monitoring)

## 3. Data Wrangling

### Reading CSV with readr
```{r read csv with readr} 
library(readr)

d = read_csv('data/jkidson_ugandasms.csv')
head(d)
summary(d)
```

### dplyr Demo

#### Pseudocode
```{r pseudocode}
# read in csv
# view data
# limit columns to Zone and satis
# limit rows to just zone "CORNER"
# get count for each satisfaction level
# write out csv
```

#### Multiple Variables
```{r multiple variables}
# read in csv
surveys = read.csv('data/jkidson_ugandasms.csv') 

# view data
head(surveys)
summary(surveys)

# limit columns to species and year
surveys_2 = surveys[,c('Zone', 'satis')]

# limit rows to just species "NL"
surveys_3 = surveys_2[surveys_2$Zone  == 'CORNER',]

# get count per year
surveys_4 = aggregate(Zone ~ satis, data=surveys_3, FUN='length')

# write to csv
write.csv(surveys_4, 'data/surveys_jkidson.csv', row.names = FALSE)
```

#### Nested Functions 

(For some reason this nested function is giving me an error, so I commented it out for now. I'll come back to it if I have time after the group assignment.)
```{r nested function}
# read in data
surveys = read.csv('data/jkidson_ugandasms.csv') 

# view data
head(surveys)
summary(surveys)

# limit data with [], aggregate to count, write to csv
# write.csv(
#  aggregate(
#    Zone ~ satis, 
#    data = surveys[surveys$satis  == 'CORNER', c('Zone', 'satis')], 
#    FUN = 'length'), 
#  'data/surveys_jkidson.csv',
#  row.names = FALSE)
```

#### Using dplyr
```{r load reard and dplyr}
# load libraries
library(readr)
library(dplyr)
```

```{r dplyr example}
# read in csv
surveys = read_csv('data/jkidson_ugandasms.csv') 

# dplyr elegance
surveys %T>%                            # note tee operator %T>% for glimpse
  glimpse() %>%                         # view data
  select(Zone, satis) %>%               # limit columns
  filter(Zone  == 'CORNER') %>%         # limit rows
  group_by(satis) %>%                   # get count by first grouping
  summarize(n = n()) %>%                # then summarize
  write_csv('data/surveys_jkidson.csv') # write out csv
```

## 4. Tidying Up Data

### EDAWR

```{r EDAWR, eval=F}
install.packages("devtools")
devtools::install_github("rstudio/EDAWR")
library(EDAWR)
help(package='EDAWR')
?storms    # wind speed data for 6 hurricanes
?cases     # subset of WHO tuberculosis
?pollution # pollution data from WHO Ambient Air Pollution, 2014
?tb        # tuberculosis data
View(storms)
View(cases)
View(pollution)
```

### slicing

```{r traditional R slicing, eval=F}
# storms
storms$storm
storms$wind
storms$pressure
storms$date

# cases
cases$country
names(cases)[-1]
unlist(cases[1:3, 2:4])

# pollution
pollution$city[c(1,3,5)]
pollution$amount[c(1,3,5)]
pollution$amount[c(2,4,6)]

# ratio
storms$pressure / storms$wind
```

## tidyr

Two main functions: gather() and spread() 

```{r tidyr, eval=F}
# install.packages("tidyr")
library(tidyr)
?gather # gather to long
?spread # spread to wide
```

### `gather`

```{r gather, eval=F}
cases %>%
# gather(cases, "year", "n", 2:4) # gathers by key of year
# if you wanted to removed a column you could enter -2 instead of 2:4 (but this makes country weird)
gather("year", "n", -country) %>%
# France and US for 2011 and 2013
  filter(
    year %in% c(2011,2013) &
    country %in% c('FR','US'))
```

### `spread`

```{r spread, eval=F}
pollution
spread(pollution, size, amount)
```

Other functions to extract and combine columns...

### `separate`

```{r separate, eval=F}
storms
storms2 = separate(storms, date, c("year", "month", "day"), sep = "-")
library(stringr)
# Splitting date strings!!!
#storms%>%
#  mutate(date_str = as.character(date)) %>%
#  separate(date_str, c("year","month","day"), sep = "-")
```

### `unite`

```{r unite, eval=F}
storms2
unite(storms2, "date", year, month, day, sep = "-")
```

**Recap: tidyr**:

- A package that reshapes the layout of data sets.

- Make observations from variables with `gather()` Make variables from observations with `spread()`

- Split and merge columns with `unite()` and `separate()`

From the [data-wrangling-cheatsheet.pdf](./refs/cheatsheets/data-wrangling-cheatsheet.pdf):

### tidy CO<sub>2</sub> emissions

_**Task**. Convert the following table [CO<sub>2</sub> emissions per country since 1970](http://edgar.jrc.ec.europa.eu/overview.php?v=CO2ts1990-2014&sort=des9) from wide to long format and output the first few rows into your Rmarkdown. I recommend consulting `?gather` and you should have 3 columns in your output._

```{r read co2}
library(readxl) # install.packages('readxl')

url = 'http://edgar.jrc.ec.europa.eu/news_docs/CO2_1970-2014_dataset_of_CO2_report_2015.xls'
xls = '../data/co2_europa.xls'

print(getwd())
if (!file.exists(xls)){
  download.file(url, xls)
}
co2 = read_excel(xls, skip=12)
co2
```

_**Question**. Why use `skip=12` argument in `read_excel()`?_
Everything before line 13 is separate from the actual data table, so skip=12 jumps over the first 12 rows.

## dplyr

A package that helps transform tabular data

```{r dplyr, eval=F}
# install.packages("dplyr")
library(dplyr)
?select
?filter
?arrange
?mutate
?group_by
?summarise
```

See sections in the [data-wrangling-cheatsheet.pdf](./refs/cheatsheets/data-wrangling-cheatsheet.pdf):

- Subset Variables (Columns), eg `select()`
- Subset Observations (Rows), eg `filter()`
- Reshaping Data - Change the layout of a data set, eg `arrange()`
- Make New Variables, eg `mutate()`
- Group Data, eg `group_by()` and `summarise()`

### `select`

```{r select, eval=F}
storms
select(storms, storm, pressure)
storms %>% select(storm, pressure)
```

### `filter`

```{r filter, eval=F}
storms
filter(storms, wind >= 50)
storms %>% filter(wind >= 50)

storms %>%
  filter(wind >= 50) %>%
  select(storm, pressure)
```

### `mutate`

```{r mutate, eval=F}
storms %>%
  mutate(ratio = pressure / wind) %>%
  select(storm, ratio)
```

### `group_by`

```{r group_by, eval=F}
pollution
pollution %>% group_by(city)
```

### `summarise`

```{r summarise, eval=F}
# by city
pollution %>% 
  group_by(city) %>%
  summarise(
    mean = mean(amount), 
    sum = sum(amount), 
    n = n())

# by size
pollution %>% 
  group_by(size) %>%
  summarise(
    mean = mean(amount), 
    sum = sum(amount), 
    n = n())
```

note that `summarize` synonymously works

### `ungroup`

```{r ungroup, eval=F}
pollution %>% 
  group_by(size)

pollution %>% 
  group_by(size) %>%
  ungroup()
```

### multiple groups

```{r multiple groups, eval=F}
tb %>%
  group_by(country, year) %>%
  summarise(cases = sum(cases))
  summarise(cases = sum(cases))
```

**Recap: dplyr**:

- Extract columns with `select()` and rows with `filter()`

- Sort rows by column with `arrange()`

- Make new columns with `mutate()`

- Group rows by column with `group_by()` and `summarise()`

See sections in the [data-wrangling-cheatsheet.pdf](./refs/cheatsheets/data-wrangling-cheatsheet.pdf):

- Subset Variables (Columns), eg `select()`

- Subset Observations (Rows), eg `filter()`

- Reshaping Data - Change the layout of a data set, eg `arrange()`

- Make New Variables, eg `mutate()`

- Group Data, eg `group_by()` and `summarise()`


### summarize CO<sub>2</sub> emissions

 
## Assignment 4

Task 1
_**Task**. Report the top 5 emitting countries (not World or EU28) for 2014 using your long format table. (You may need to convert your year column from factor to numeric, eg `mutate(year = as.numeric(as.character(year)))`. As with most analyses, there are multiple ways to do this. I used the following functions: `filter`, `arrange`, `desc`, `head`)_.
```{r Assignment 4 Task 1}
  names(co2) = c('country', 1970:2014) # get rid of extra decimal places in column names
  library(dplyr)
  library(tidyr)
  co2
  co2 %>%
    gather("year","n", 2:46) %>%
    filter(year %in% 2014) %>%
    subset(country != c("World","EU28")) %>%
    arrange(desc(n)) %>% 
    head(n=5)
```

Task 2
_**Task**. Summarize the total emissions by country  (not World or EU28) across years from your long format table and return the top 5 emitting countries. (As with most analyses, there are multiple ways to do this. I used the following functions: `filter`, `arrange`, `desc`, `head`)_.
```{r Assignment 4 Task 2}
co2 %>%
    gather("year","n", 2:46)%>%
    group_by(country) %>%
    subset(country != c("World","EU28")) %>%
    summarise(
      sum=sum(n)) %>%
    arrange(desc(sum)) %>%
    head
```

# by city
pollution %>% 
  group_by(city) %>%
  summarise(
    mean = mean(amount), 
    sum = sum(amount), 
    n = n())

# by size
pollution %>% 
  group_by(size) %>%
  summarise(
    mean = mean(amount), 
    sum = sum(amount), 
    n = n())

## joining data

### `bind_cols`

```{r bind_cols, eval=F}
y = data.frame(
  x1 = c('A','B','C'), 
  x2 = c( 1 , 2 , 3), 
  stringsAsFactors=F)
z = data.frame(
  x1 = c('B','C','D'), 
  x2 = c( 2 , 3 , 4), 
  stringsAsFactors=F)
y
z
bind_cols(y, z)
```

### `bind_rows`

```{r bind_rows, eval=F}
y
z
bind_rows(y, z)
```

### `union`

```{r union, eval=F}
y
z
union(y, z)
```

### `intersect`

```{r intersect, eval=F}
y
z
intersect(y, z)
```

### `setdiff`

```{r setdiff, eval=F}
y
z
setdiff(y, z)
```

### `left_join`

```{r left_join, eval=F}
songs = data.frame(
  song = c('Across the Universe','Come Together', 'Hello, Goodbye', 'Peggy Sue'),
  name = c('John','John','Paul','Buddy'), 
  stringsAsFactors=F)
artists = data.frame(
  name = c('George','John','Paul','Ringo'),
  plays = c('sitar','guitar','bass','drums'), 
  stringsAsFactors=F)
left_join(songs, artists, by='name')
```

### `inner_join`

```{r inner_join, eval=F}
inner_join(songs, artists, by = "name")
```

### `semi_join`

```{r semi_join, eval=F}
semi_join(songs, artists, by = "name")
```

### `anti_join`

```{r anti_join, eval=F}
anti_join(songs, artists, by = "name")
```

### summarize per capita CO<sub>2</sub> emissions 

You'll join the [gapminder](https://github.com/jennybc/gapminder) datasets to get world population per country.

_**Task**. Report the top 5 emitting countries (not World or EU28) per capita using your long format table. (You may need to convert your year column from factor to numeric, eg `mutate(year = as.numeric(as.character(year)))`. As with most analyses, there are multiple ways to do this. I used the following functions: `filter`, `arrange`, `desc`, `head`)_. 

_**Task**. Summarize the total emissions by country  (not World or EU28) per capita across years from your long format table and return the top 5 emitting countries. (As with most analyses, there are multiple ways to do this. I used the following functions: `filter`, `arrange`, `desc`, `head`)_. 

```{r gapminder, eval=F}
library(gapminder) # install.packages('gapminder')
```

## Assignment 4