diff --git a/_freeze/materials/working-with-data/execute-results/html.json b/_freeze/materials/working-with-data/execute-results/html.json index e0b39b4..36f3ffb 100644 --- a/_freeze/materials/working-with-data/execute-results/html.json +++ b/_freeze/materials/working-with-data/execute-results/html.json @@ -1,8 +1,8 @@ { - "hash": "1a366eb577b18b5cb753f6bf60387ef9", + "hash": "50428d5ac4e398670a85e2eba4d44b8e", "result": { "engine": "knitr", - "markdown": "---\ntitle: \"Working with data\"\n---\n\n::: {.cell}\n\n:::\n\n::: {.cell}\n\n:::\n\n\n::: {.callout-tip}\n## Learning outcomes\n\n- Be able to import tabular data\n- Perform basic operations on data\n\n:::\n\n## Libraries and functions\n\n::: {.callout-note collapse=\"true\"}\n## Click to expand\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n### Libraries\n### Functions\n\n:::\n:::\n\n## Purpose and aim\n\nIn this section we're covering the basics of reading in using tabular data.\n\n## Darwin's finches\n\nWe'll look at some data that come from an analysis of gene flow across two finch species [@lamichhaney2020].\n\nThe data focus on two species, _Geospiza fortis_ and _G. scandens_. The measurements are split by a uniquely timed event: a particularly strong El Niño event in 1983. This event changed the vegetation and food supply of the finches, allowing F1 hybrids of the two species to survive, whereas before 1983 they could not. The measurements are classed as `early` (pre-1983) and `late` (1983 onwards).\n\n## Reading in data\n\n::: {.panel-tabset group=\"language\"}\n## R\n\nThere are several functions to read data into R, we're going to use one from the \n`readr` package, which is part of the `tidyverse`. As such, we first need to load \nthe package into R's memory, by using the `library()` function:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(tidyverse)\n```\n:::\n\n\nThis command has to be run every time you start a new R session. Typically you \nwant to include the `library()` calls at the top of your script, so that a user \nknows which packages need to be installed to run the analysis.\n\nOur data is provided in CSV format (comma separated values). This format is a \nregular text file, where each value (or column of the table) is separated by a \ncomma. To read such a file, we use the `read_csv()` function, which needs at least \none input: the _path_ of the file we want to read. It is also good practice \nto explicitly define how missing data is encoded in the file with the `na` option. \nIn our case, missing data are encoded as an empty string (imagine this as an empty \ncell in a spreadsheet).\n\nHere's the command:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfinches <- read_csv(\"data/finches.csv\")\n```\n\n::: {.cell-output .cell-output-stderr}\n\n```\nRows: 180 Columns: 12\n── Column specification ────────────────────────────────────────────────────────\nDelimiter: \",\"\nchr (2): species, group\ndbl (9): weight, wing, tarsus, blength, bdepth, bwidth, pc1_body, pc1_beak, ...\nlgl (1): is_early\n\nℹ Use `spec()` to retrieve the full column specification for this data.\nℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.\n```\n\n\n:::\n:::\n\n\nWe see a lot of output there, but this is not an error! It's a message that `read_csv()` \nprints to inform us of what type of data it thinks each column of the data set is. \nWe'll discuss this in a while.\n\nIt's always useful to have a glimpse at the first few rows of your data set, to see how it is structured. We can do that with the `head()` function.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nhead(finches)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 6 × 12\n species group weight wing tarsus blength bdepth bwidth pc1_body pc1_beak\n \n1 G. fortis Early b… 15.8 67.1 19.6 10.3 8.95 8.32 0.382 -0.431\n2 G. fortis Early b… 15.2 66 18.3 10.4 8.7 8.4 -1.06 -0.452\n3 G. fortis Early b… 18.0 68 18.9 11.2 9.6 8.83 0.839 0.955\n4 G. fortis Early b… 18.5 70.3 19.7 11 9.7 8.73 2.16 0.824\n5 G. fortis Early b… 15.7 69 18.9 10.9 9.8 9 0.332 1.08 \n6 G. fortis Early b… 17.8 70.1 19.2 12.7 10.9 9.79 1.50 3.55 \n# ℹ 2 more variables: pc2_beak , is_early \n```\n\n\n:::\n:::\n\n\n\n### The `data.frame` object\n\nA **data.frame** is the basic type of object that stores _tabular_ data. \nThe `readr` package reads data in an \"extended\" version of a data frame that it \ncalls **tibble** (`tbl` for short). The details of their differences are not very \nimportant unless you are a programmer, but _tibbles_ offer some user conveniences \nsuch as a better printing method. For the rest of the course we'll refer to \n\"data frames\" and \"tibbles\" interchangeably.\n\n:::\n\n## Subsetting data\n\n::: {.panel-tabset group=\"language\"}\n## R\n\nWe can subset the data in our `finches` table by **column** or **row**. The `tidyverse` package has a series of useful functions that allow you to do this.\n\n### Subsetting by column\n\nWe can use the `select()` function to select certain columns, for example if we just wanted the `country` and `year` column. The first argument we give to the function is the data set, followed by the name of the columns we want:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nselect(finches, group, wing)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 180 × 2\n group wing\n \n 1 Early blunt 67.1\n 2 Early blunt 66 \n 3 Early blunt 68 \n 4 Early blunt 70.3\n 5 Early blunt 69 \n 6 Early blunt 70.1\n 7 Early blunt 69 \n 8 Early blunt 68.5\n 9 Early blunt 66.3\n10 Early blunt 69 \n# ℹ 170 more rows\n```\n\n\n:::\n:::\n\n\n\n### Subsetting by row\n\nNow let's say we wanted to only keep certain observations - which are organised in rows. Here we can use the `filter()` function. For example, if we only wanted the data for the United Kingdom:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfilter(finches, species == \"G. fortis\")\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 89 × 12\n species group weight wing tarsus blength bdepth bwidth pc1_body pc1_beak\n \n 1 G. fortis Early … 15.8 67.1 19.6 10.3 8.95 8.32 0.382 -0.431\n 2 G. fortis Early … 15.2 66 18.3 10.4 8.7 8.4 -1.06 -0.452\n 3 G. fortis Early … 18.0 68 18.9 11.2 9.6 8.83 0.839 0.955\n 4 G. fortis Early … 18.5 70.3 19.7 11 9.7 8.73 2.16 0.824\n 5 G. fortis Early … 15.7 69 18.9 10.9 9.8 9 0.332 1.08 \n 6 G. fortis Early … 17.8 70.1 19.2 12.7 10.9 9.79 1.50 3.55 \n 7 G. fortis Early … 17.2 69 20.3 11.9 9.8 9 1.86 1.67 \n 8 G. fortis Early … 17.2 68.5 19.2 11.4 9.8 8.6 0.879 1.00 \n 9 G. fortis Early … 16.5 66.3 18.7 9.04 8.42 7.98 -0.227 -1.81 \n10 G. fortis Early … 19.4 69 18.7 11.3 9.6 8.8 1.39 1.00 \n# ℹ 79 more rows\n# ℹ 2 more variables: pc2_beak , is_early \n```\n\n\n:::\n:::\n\n\nHere we've taken the `finches` data set and we asked R to give us the rows where `species == \"G. fortis\"` is `TRUE`. It goes through all the rows, in this case checking the `species` column. If the statement `species == \"G. fortis\"` is `TRUE`, it returns the row. Otherwise it doesn't.\n\nWe could also use a different conditional statement, for example returning all the rows where the `weight` is larger than 18 grammes:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfilter(finches, weight > 18)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 73 × 12\n species group weight wing tarsus blength bdepth bwidth pc1_body pc1_beak\n \n 1 G. fortis Early … 18.5 70.3 19.7 11 9.7 8.73 2.16 0.824\n 2 G. fortis Early … 19.4 69 18.7 11.3 9.6 8.8 1.39 1.00 \n 3 G. fortis Early … 18.0 68.2 18.8 12.3 10.6 9.5 0.826 2.88 \n 4 G. fortis Early … 19.9 67 20 11 10 8.8 2.01 1.07 \n 5 G. fortis Early … 18.4 70.9 20.1 11.4 10.8 10.1 2.57 3.02 \n 6 G. fortis Early … 18.2 68 18.4 10.9 9.7 9.03 0.487 1.03 \n 7 G. fortis Early … 18.4 70 19.7 11.8 10.3 9.4 2.06 2.29 \n 8 G. fortis Early … 19.8 75.6 19.2 12.8 9.3 8.53 3.54 1.45 \n 9 G. fortis Early … 18.8 71 19.2 11.8 9.9 8.5 2.08 1.20 \n10 G. fortis Late b… 19.0 70 19.8 12 11.2 9.9 2.32 3.40 \n# ℹ 63 more rows\n# ℹ 2 more variables: pc2_beak , is_early \n```\n\n\n:::\n:::\n\n:::\n\n## Chaining commands\n\nSometimes we need to perform many different operations before we have the right data in the correct format that we need. For example, we might want to filter for certain values and then only keep certain columns. We could perform these operations one by one and save the output of each into an object that we then use for the next operation.\n\nBut this is not very efficient. So it can be useful to chain certain operations together, performing them one by one.\n\n::: {.panel-tabset group=\"language\"}\n## R\n\nIn R we can do this with the **pipe**. We'll be using the pipe operator for tidyverse (`%>%`). The pipe always starts with **data**, which it then \"pipes through\" to a function.\n\nLet's look at an example, recreating the `filter()` operation we did earlier, but this time with a pipe:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfinches %>% \n filter(weight > 18)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 73 × 12\n species group weight wing tarsus blength bdepth bwidth pc1_body pc1_beak\n \n 1 G. fortis Early … 18.5 70.3 19.7 11 9.7 8.73 2.16 0.824\n 2 G. fortis Early … 19.4 69 18.7 11.3 9.6 8.8 1.39 1.00 \n 3 G. fortis Early … 18.0 68.2 18.8 12.3 10.6 9.5 0.826 2.88 \n 4 G. fortis Early … 19.9 67 20 11 10 8.8 2.01 1.07 \n 5 G. fortis Early … 18.4 70.9 20.1 11.4 10.8 10.1 2.57 3.02 \n 6 G. fortis Early … 18.2 68 18.4 10.9 9.7 9.03 0.487 1.03 \n 7 G. fortis Early … 18.4 70 19.7 11.8 10.3 9.4 2.06 2.29 \n 8 G. fortis Early … 19.8 75.6 19.2 12.8 9.3 8.53 3.54 1.45 \n 9 G. fortis Early … 18.8 71 19.2 11.8 9.9 8.5 2.08 1.20 \n10 G. fortis Late b… 19.0 70 19.8 12 11.2 9.9 2.32 3.40 \n# ℹ 63 more rows\n# ℹ 2 more variables: pc2_beak , is_early \n```\n\n\n:::\n:::\n\n\nWhat it's done is taken the `finches` data set and then sent this to the `filter()` function. The function doesn't need the data set specified explicitly, because it knows it is coming from the pipe.\n\nWe can combine this with other functions:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfinches %>% \n filter(weight > 18) %>% \n select(species, weight)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 73 × 2\n species weight\n \n 1 G. fortis 18.5\n 2 G. fortis 19.4\n 3 G. fortis 18.0\n 4 G. fortis 19.9\n 5 G. fortis 18.4\n 6 G. fortis 18.2\n 7 G. fortis 18.4\n 8 G. fortis 19.8\n 9 G. fortis 18.8\n10 G. fortis 19.0\n# ℹ 63 more rows\n```\n\n\n:::\n:::\n\n\nHere we've performed the filtering, and then selected the `species` and `weight` columns.\n\n:::\n\nChaining operations can be a very powerful tool, since it allows you to break down a complex operation into smaller steps. This often makes the analysis a lot less daunting!\n\n## Summary\n\n::: {.callout-tip}\n#### Key points\n\n- Tabular data are an excellent format for programming languages\n- Having variables in columns and observations in rows makes analysis easier\n- We can subset data across columns and rows\n\n:::\n", + "markdown": "---\ntitle: \"Working with data\"\n---\n\n::: {.cell}\n\n:::\n\n::: {.cell}\n\n:::\n\n\n::: {.callout-tip}\n## Learning outcomes\n\n- Be able to import tabular data\n- Perform basic operations on data\n\n:::\n\n## Libraries and functions\n\n::: {.callout-note collapse=\"true\"}\n## Click to expand\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n### Libraries\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(tidyverse)\n```\n:::\n\n\n### Functions\n\nFunctions below are mostly shown in the following way:\n\n`package_name::name_of_function()`\n\nThe reason why we're doing this is two-fold:\n\n1. To make it explicit that functions are often packaged together into 'umbrella' packages. Tidyverse is one of those - it contains many packages such as `tidyr`, `ggplot2`, `readr`. This way it's clear which package each particular function is coming from.\n2. Sometimes the same function name is used across different packages. We'll see that later, where there is a `filter()` function in both the `stats` and `dplyr` packages. Throughout the course the correct one should be loaded automatically, but this way you can always check!\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# read in a .csv (comma-separated values) file\nreadr::read_csv()\n\n# show the first 6 rows of a table\nhead()\n\n# select columns in a table\ndplyr::select()\n\n# filter rows in a table\ndplyr::filter()\n```\n:::\n\n\n:::\n:::\n\n## Purpose and aim\n\nIn this section we're covering the basics of reading in using tabular data.\n\n## Darwin's finches\n\nWe'll look at some data that come from an analysis of gene flow across two finch species [@lamichhaney2020].\n\nThe data focus on two species, _Geospiza fortis_ and _G. scandens_. The measurements are split by a uniquely timed event: a particularly strong El Niño event in 1983. This event changed the vegetation and food supply of the finches, allowing F1 hybrids of the two species to survive, whereas before 1983 they could not. The measurements are classed as `early` (pre-1983) and `late` (1983 onwards).\n\n## Reading in data\n\n::: {.panel-tabset group=\"language\"}\n## R\n\nThere are several functions to read data into R, we're going to use one from the \n`readr` package, which is part of the `tidyverse`. As such, we first need to load \nthe package into R's memory, by using the `library()` function:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(tidyverse)\n```\n:::\n\n\nThis command has to be run every time you start a new R session. Typically you \nwant to include the `library()` calls at the top of your script, so that a user \nknows which packages need to be installed to run the analysis.\n\nOur data is provided in CSV format (comma separated values). This format is a \nregular text file, where each value (or column of the table) is separated by a \ncomma. To read such a file, we use the `read_csv()` function, which needs at least \none input: the _path_ of the file we want to read. It is also good practice \nto explicitly define how missing data is encoded in the file with the `na` option. \nIn our case, missing data are encoded as an empty string (imagine this as an empty \ncell in a spreadsheet).\n\nHere's the command:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfinches <- read_csv(\"data/finches.csv\")\n```\n\n::: {.cell-output .cell-output-stderr}\n\n```\nRows: 180 Columns: 12\n── Column specification ────────────────────────────────────────────────────────\nDelimiter: \",\"\nchr (2): species, group\ndbl (9): weight, wing, tarsus, blength, bdepth, bwidth, pc1_body, pc1_beak, ...\nlgl (1): is_early\n\nℹ Use `spec()` to retrieve the full column specification for this data.\nℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.\n```\n\n\n:::\n:::\n\n\nWe see a lot of output there, but this is not an error! It's a message that `read_csv()` \nprints to inform us of what type of data it thinks each column of the data set is. \nWe'll discuss this in a while.\n\nIt's always useful to have a glimpse at the first few rows of your data set, to see how it is structured. We can do that with the `head()` function.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nhead(finches)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 6 × 12\n species group weight wing tarsus blength bdepth bwidth pc1_body pc1_beak\n \n1 G. fortis early_b… 15.8 67.1 19.6 10.3 8.95 8.32 0.382 -0.431\n2 G. fortis early_b… 15.2 66 18.3 10.4 8.7 8.4 -1.06 -0.452\n3 G. fortis early_b… 18.0 68 18.9 11.2 9.6 8.83 0.839 0.955\n4 G. fortis early_b… 18.5 70.3 19.7 11 9.7 8.73 2.16 0.824\n5 G. fortis early_b… 15.7 69 18.9 10.9 9.8 9 0.332 1.08 \n6 G. fortis early_b… 17.8 70.1 19.2 12.7 10.9 9.79 1.50 3.55 \n# ℹ 2 more variables: pc2_beak , is_early \n```\n\n\n:::\n:::\n\n\n\n### The `data.frame` object\n\nA **data.frame** is the basic type of object that stores _tabular_ data. \nThe `readr` package reads data in an \"extended\" version of a data frame that it \ncalls **tibble** (`tbl` for short). The details of their differences are not very \nimportant unless you are a programmer, but _tibbles_ offer some user conveniences \nsuch as a better printing method. For the rest of the course we'll refer to \n\"data frames\" and \"tibbles\" interchangeably.\n\n:::\n\n## Subsetting data\n\n::: {.panel-tabset group=\"language\"}\n## R\n\nWe can subset the data in our `finches` table by **column** or **row**. The `tidyverse` package has a series of useful functions that allow you to do this.\n\n### Subsetting by column\n\nWe can use the `select()` function to select certain columns, for example if we just wanted the `country` and `year` column. The first argument we give to the function is the data set, followed by the name of the columns we want:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nselect(finches, group, wing)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 180 × 2\n group wing\n \n 1 early_blunt 67.1\n 2 early_blunt 66 \n 3 early_blunt 68 \n 4 early_blunt 70.3\n 5 early_blunt 69 \n 6 early_blunt 70.1\n 7 early_blunt 69 \n 8 early_blunt 68.5\n 9 early_blunt 66.3\n10 early_blunt 69 \n# ℹ 170 more rows\n```\n\n\n:::\n:::\n\n\n\n### Subsetting by row\n\nNow let's say we wanted to only keep certain observations - which are organised in rows. Here we can use the `filter()` function. For example, if we only wanted the data for the United Kingdom:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfilter(finches, species == \"G. fortis\")\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 89 × 12\n species group weight wing tarsus blength bdepth bwidth pc1_body pc1_beak\n \n 1 G. fortis early_… 15.8 67.1 19.6 10.3 8.95 8.32 0.382 -0.431\n 2 G. fortis early_… 15.2 66 18.3 10.4 8.7 8.4 -1.06 -0.452\n 3 G. fortis early_… 18.0 68 18.9 11.2 9.6 8.83 0.839 0.955\n 4 G. fortis early_… 18.5 70.3 19.7 11 9.7 8.73 2.16 0.824\n 5 G. fortis early_… 15.7 69 18.9 10.9 9.8 9 0.332 1.08 \n 6 G. fortis early_… 17.8 70.1 19.2 12.7 10.9 9.79 1.50 3.55 \n 7 G. fortis early_… 17.2 69 20.3 11.9 9.8 9 1.86 1.67 \n 8 G. fortis early_… 17.2 68.5 19.2 11.4 9.8 8.6 0.879 1.00 \n 9 G. fortis early_… 16.5 66.3 18.7 9.04 8.42 7.98 -0.227 -1.81 \n10 G. fortis early_… 19.4 69 18.7 11.3 9.6 8.8 1.39 1.00 \n# ℹ 79 more rows\n# ℹ 2 more variables: pc2_beak , is_early \n```\n\n\n:::\n:::\n\n\nHere we've taken the `finches` data set and we asked R to give us the rows where `species == \"G. fortis\"` is `TRUE`. It goes through all the rows, in this case checking the `species` column. If the statement `species == \"G. fortis\"` is `TRUE`, it returns the row. Otherwise it doesn't.\n\nWe could also use a different conditional statement, for example returning all the rows where the `weight` is larger than 18 grammes:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfilter(finches, weight > 18)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 73 × 12\n species group weight wing tarsus blength bdepth bwidth pc1_body pc1_beak\n \n 1 G. fortis early_… 18.5 70.3 19.7 11 9.7 8.73 2.16 0.824\n 2 G. fortis early_… 19.4 69 18.7 11.3 9.6 8.8 1.39 1.00 \n 3 G. fortis early_… 18.0 68.2 18.8 12.3 10.6 9.5 0.826 2.88 \n 4 G. fortis early_… 19.9 67 20 11 10 8.8 2.01 1.07 \n 5 G. fortis early_… 18.4 70.9 20.1 11.4 10.8 10.1 2.57 3.02 \n 6 G. fortis early_… 18.2 68 18.4 10.9 9.7 9.03 0.487 1.03 \n 7 G. fortis early_… 18.4 70 19.7 11.8 10.3 9.4 2.06 2.29 \n 8 G. fortis early_… 19.8 75.6 19.2 12.8 9.3 8.53 3.54 1.45 \n 9 G. fortis early_… 18.8 71 19.2 11.8 9.9 8.5 2.08 1.20 \n10 G. fortis late_b… 19.0 70 19.8 12 11.2 9.9 2.32 3.40 \n# ℹ 63 more rows\n# ℹ 2 more variables: pc2_beak , is_early \n```\n\n\n:::\n:::\n\n:::\n\n## Chaining commands\n\nSometimes we need to perform many different operations before we have the right data in the correct format that we need. For example, we might want to filter for certain values and then only keep certain columns. We could perform these operations one by one and save the output of each into an object that we then use for the next operation.\n\nBut this is not very efficient. So it can be useful to chain certain operations together, performing them one by one.\n\n::: {.panel-tabset group=\"language\"}\n## R\n\nIn R we can do this with the **pipe**. We'll be using the pipe operator for tidyverse (`%>%`). The pipe always starts with **data**, which it then \"pipes through\" to a function.\n\nLet's look at an example, recreating the `filter()` operation we did earlier, but this time with a pipe:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfinches %>% \n filter(weight > 18)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 73 × 12\n species group weight wing tarsus blength bdepth bwidth pc1_body pc1_beak\n \n 1 G. fortis early_… 18.5 70.3 19.7 11 9.7 8.73 2.16 0.824\n 2 G. fortis early_… 19.4 69 18.7 11.3 9.6 8.8 1.39 1.00 \n 3 G. fortis early_… 18.0 68.2 18.8 12.3 10.6 9.5 0.826 2.88 \n 4 G. fortis early_… 19.9 67 20 11 10 8.8 2.01 1.07 \n 5 G. fortis early_… 18.4 70.9 20.1 11.4 10.8 10.1 2.57 3.02 \n 6 G. fortis early_… 18.2 68 18.4 10.9 9.7 9.03 0.487 1.03 \n 7 G. fortis early_… 18.4 70 19.7 11.8 10.3 9.4 2.06 2.29 \n 8 G. fortis early_… 19.8 75.6 19.2 12.8 9.3 8.53 3.54 1.45 \n 9 G. fortis early_… 18.8 71 19.2 11.8 9.9 8.5 2.08 1.20 \n10 G. fortis late_b… 19.0 70 19.8 12 11.2 9.9 2.32 3.40 \n# ℹ 63 more rows\n# ℹ 2 more variables: pc2_beak , is_early \n```\n\n\n:::\n:::\n\n\nWhat it's done is taken the `finches` data set and then sent this to the `filter()` function. The function doesn't need the data set specified explicitly, because it knows it is coming from the pipe.\n\nWe can combine this with other functions:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfinches %>% \n filter(weight > 18) %>% \n select(species, weight)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 73 × 2\n species weight\n \n 1 G. fortis 18.5\n 2 G. fortis 19.4\n 3 G. fortis 18.0\n 4 G. fortis 19.9\n 5 G. fortis 18.4\n 6 G. fortis 18.2\n 7 G. fortis 18.4\n 8 G. fortis 19.8\n 9 G. fortis 18.8\n10 G. fortis 19.0\n# ℹ 63 more rows\n```\n\n\n:::\n:::\n\n\nHere we've performed the filtering, and then selected the `species` and `weight` columns.\n\n:::\n\nChaining operations can be a very powerful tool, since it allows you to break down a complex operation into smaller steps. This often makes the analysis a lot less daunting!\n\n## Summary\n\n::: {.callout-tip}\n#### Key points\n\n- Tabular data are an excellent format for programming languages\n- Having variables in columns and observations in rows makes analysis easier\n- We can subset data across columns and rows\n\n:::\n", "supporting": [ "working-with-data_files" ], diff --git a/materials/working-with-data.qmd b/materials/working-with-data.qmd index e518cae..2cb57d1 100644 --- a/materials/working-with-data.qmd +++ b/materials/working-with-data.qmd @@ -33,8 +33,38 @@ exec(open('setup-files/setup.py').read()) ## R ### Libraries + +```{r} +#| eval: false +library(tidyverse) +``` + ### Functions +Functions below are mostly shown in the following way: + +`package_name::name_of_function()` + +The reason why we're doing this is two-fold: + +1. To make it explicit that functions are often packaged together into 'umbrella' packages. Tidyverse is one of those - it contains many packages such as `tidyr`, `ggplot2`, `readr`. This way it's clear which package each particular function is coming from. +2. Sometimes the same function name is used across different packages. We'll see that later, where there is a `filter()` function in both the `stats` and `dplyr` packages. Throughout the course the correct one should be loaded automatically, but this way you can always check! + +```{r} +#| eval: false +# read in a .csv (comma-separated values) file +readr::read_csv() + +# show the first 6 rows of a table +head() + +# select columns in a table +dplyr::select() + +# filter rows in a table +dplyr::filter() +``` + ::: :::