diff --git a/.DS_Store b/.DS_Store index b72cfa4..2c86139 100644 Binary files a/.DS_Store and b/.DS_Store differ diff --git a/.github/.DS_Store b/.github/.DS_Store index 135860d..0270abd 100644 Binary files a/.github/.DS_Store and b/.github/.DS_Store differ diff --git a/_extensions/.DS_Store b/_extensions/.DS_Store index a9373d8..53d5d6a 100644 Binary files a/_extensions/.DS_Store and b/_extensions/.DS_Store differ diff --git a/_extensions/cambiotraining/.DS_Store b/_extensions/cambiotraining/.DS_Store index 8329a97..4603f08 100644 Binary files a/_extensions/cambiotraining/.DS_Store and b/_extensions/cambiotraining/.DS_Store differ diff --git a/_freeze/materials/data-wrangling/execute-results/html.json b/_freeze/materials/data-wrangling/execute-results/html.json index 01151a8..8469d9a 100644 --- a/_freeze/materials/data-wrangling/execute-results/html.json +++ b/_freeze/materials/data-wrangling/execute-results/html.json @@ -1,8 +1,8 @@ { - "hash": "89c9a5d144fae02a389f64f2a1e39a54", + "hash": "3c8219735b022554c2bf01b336cf5f44", "result": { "engine": "knitr", - "markdown": "---\ntitle: \"Data wrangling\"\n---\n\n::: {.cell}\n\n:::\n\n::: {.cell}\n\n:::\n\n\n::: {.callout-tip}\n## Learning outcomes\n\n- Be able to make changes to variables (columns)\n- Be able to make changes to observations (rows)\n- Implement changes on a grouped basis\n\n:::\n\n## Libraries and functions\n\n::: {.callout-note collapse=\"true\"}\n## Click to expand\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n### Libraries\n### Functions\n\n:::\n:::\n\n## Purpose and aim\n\nOften, there is not one single data format that allows you to do all of your analysis. Getting comfortable with making changes to the way your data are organised is an important skill. This is sometimes referred to as 'data wrangling'. In this section we'll learn how we can change the organisation of columns, how to add new columns, manipulate rows and perform these operations on subgroups of the data.\n\n## Reading in data\n\n::: {.panel-tabset group=\"language\"}\n## R\n\nWe'll keep using our data set on Darwin's finches. If you haven't read these data in, please do so with the following:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfinches <- read_csv(\"data/finches.csv\")\n```\n:::\n\n:::\n\nCONTENT COMING SOON\n\n## Summary\n\n::: {.callout-tip}\n#### Key points\n\n- \n\n:::\n", + "markdown": "---\ntitle: \"Data wrangling\"\n---\n\n::: {.cell}\n\n:::\n\n::: {.cell}\n\n:::\n\n\n::: {.callout-tip}\n## Learning outcomes\n\n- Be able to make changes to variables (columns)\n- Be able to make changes to observations (rows)\n- Implement changes on a grouped basis\n\n:::\n\n## Libraries and functions\n\n::: {.callout-note collapse=\"true\"}\n## Click to expand\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n### Libraries\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(tidyverse)\n```\n:::\n\n\n### Functions\n\n:::\n:::\n\n## Purpose and aim\n\nOften, there is not one single data format that allows you to do all of your analysis. Getting comfortable with making changes to the way your data are organised is an important skill. This is sometimes referred to as 'data wrangling'. In this section we'll learn how we can change the organisation of columns, how to add new columns, manipulate rows and perform these operations on subgroups of the data.\n\n## Reading in data\n\n::: {.panel-tabset group=\"language\"}\n## R\n\nWe'll keep using our data set on Darwin's finches. If you haven't read these data in, please do so with the following:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfinches <- read_csv(\"data/finches.csv\")\n```\n:::\n\n:::\n\n## Creating new columns\n\nSometimes you'll have to create new columns in your data set. For example, you might have a column that records something in kilograms, but you need it in milligrams. You'd then have to either convert the original column or create a new one with the new data.\n\nLet's see how to do this using the `weight` column from the `finches` data.\n\n::: {.panel-tabset group=\"language\"}\n## R\n\nWe'll use pipes to do this, so we can see what R is doing without immediately updating the data. This is generally a useful technique: check each step one-by-one and after you're happy with the changes, *then* update the table.\n\nTo add a column, we use the `mutate()` function. We first define the name of the *new column*, then tell it what needs to go in it.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfinches %>% \n mutate(weight_kg = weight / 1000)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 180 × 13\n species group weight wing tarsus blength bdepth bwidth pc1_body pc1_beak\n \n 1 G. fortis Early … 15.8 67.1 19.6 10.3 8.95 8.32 0.382 -0.431\n 2 G. fortis Early … 15.2 66 18.3 10.4 8.7 8.4 -1.06 -0.452\n 3 G. fortis Early … 18.0 68 18.9 11.2 9.6 8.83 0.839 0.955\n 4 G. fortis Early … 18.5 70.3 19.7 11 9.7 8.73 2.16 0.824\n 5 G. fortis Early … 15.7 69 18.9 10.9 9.8 9 0.332 1.08 \n 6 G. fortis Early … 17.8 70.1 19.2 12.7 10.9 9.79 1.50 3.55 \n 7 G. fortis Early … 17.2 69 20.3 11.9 9.8 9 1.86 1.67 \n 8 G. fortis Early … 17.2 68.5 19.2 11.4 9.8 8.6 0.879 1.00 \n 9 G. fortis Early … 16.5 66.3 18.7 9.04 8.42 7.98 -0.227 -1.81 \n10 G. fortis Early … 19.4 69 18.7 11.3 9.6 8.8 1.39 1.00 \n# ℹ 170 more rows\n# ℹ 3 more variables: pc2_beak , is_early , weight_kg \n```\n\n\n:::\n:::\n\n\nYou'll probably notice that our new column isn't visible on screen. This is because we have quite a few columns in our table. We can move the new column to directly after the `weight` column. We use the `relocate()` function for this.\n\nWe tell `relocate()` which column we want to move, then use the `.after =` argument to specify where we want to insert the column.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfinches %>% \n mutate(weight_kg = weight / 1000) %>% \n relocate(weight_kg, .after = weight)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 180 × 13\n species group weight weight_kg wing tarsus blength bdepth bwidth pc1_body\n \n 1 G. fortis Early… 15.8 0.0158 67.1 19.6 10.3 8.95 8.32 0.382\n 2 G. fortis Early… 15.2 0.0152 66 18.3 10.4 8.7 8.4 -1.06 \n 3 G. fortis Early… 18.0 0.0180 68 18.9 11.2 9.6 8.83 0.839\n 4 G. fortis Early… 18.5 0.0185 70.3 19.7 11 9.7 8.73 2.16 \n 5 G. fortis Early… 15.7 0.0157 69 18.9 10.9 9.8 9 0.332\n 6 G. fortis Early… 17.8 0.0178 70.1 19.2 12.7 10.9 9.79 1.50 \n 7 G. fortis Early… 17.2 0.0172 69 20.3 11.9 9.8 9 1.86 \n 8 G. fortis Early… 17.2 0.0172 68.5 19.2 11.4 9.8 8.6 0.879\n 9 G. fortis Early… 16.5 0.0165 66.3 18.7 9.04 8.42 7.98 -0.227\n10 G. fortis Early… 19.4 0.0194 69 18.7 11.3 9.6 8.8 1.39 \n# ℹ 170 more rows\n# ℹ 3 more variables: pc1_beak , pc2_beak , is_early \n```\n\n\n:::\n:::\n\n\n:::\n\nWe can see that the new column indeed contains the new weight measurements, composed of the original `weight` values divided by 1,000.\n\nNow that we know this gives us the result we want, we can update the original table:\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfinches <- finches %>% \n mutate(weight_kg = weight / 1000) %>% \n relocate(weight_kg, .after = weight)\n```\n:::\n\n\n:::\n\n## Grouping and summarising\n\nA very common technique used in data analysis is the \"split-apply-combine\". This is a three-step process, where we:\n\n1. Split the data into subgroups.\n2. Apply a set of transformations / calculations / ... to each subgroup.\n3. Combine the result into a single table.\n\n### Groups\n\nI happen to know that there are two distinct species in this data set. Let's say we're interested in finding out how many observations we have for each species.\n\nThere are two steps to this process:\n\n1. We need to split the data by `species`.\n2. We need to count the number of rows (= observations) in each subgroup.\n\n::: {.panel-tabset group=\"language\"}\n## R\n\nWe can use the `group_by()` function to group data by a given variable. Here, we will group the data by `species`:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfinches %>% \n group_by(species)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 180 × 13\n# Groups: species [2]\n species group weight weight_kg wing tarsus blength bdepth bwidth pc1_body\n \n 1 G. fortis Early… 15.8 0.0158 67.1 19.6 10.3 8.95 8.32 0.382\n 2 G. fortis Early… 15.2 0.0152 66 18.3 10.4 8.7 8.4 -1.06 \n 3 G. fortis Early… 18.0 0.0180 68 18.9 11.2 9.6 8.83 0.839\n 4 G. fortis Early… 18.5 0.0185 70.3 19.7 11 9.7 8.73 2.16 \n 5 G. fortis Early… 15.7 0.0157 69 18.9 10.9 9.8 9 0.332\n 6 G. fortis Early… 17.8 0.0178 70.1 19.2 12.7 10.9 9.79 1.50 \n 7 G. fortis Early… 17.2 0.0172 69 20.3 11.9 9.8 9 1.86 \n 8 G. fortis Early… 17.2 0.0172 68.5 19.2 11.4 9.8 8.6 0.879\n 9 G. fortis Early… 16.5 0.0165 66.3 18.7 9.04 8.42 7.98 -0.227\n10 G. fortis Early… 19.4 0.0194 69 18.7 11.3 9.6 8.8 1.39 \n# ℹ 170 more rows\n# ℹ 3 more variables: pc1_beak , pc2_beak , is_early \n```\n\n\n:::\n:::\n\n\nThis doesn't seem to make much difference, since it's still outputting all of the data. However, if you look closely, you will notice that next to the `A tibble: 180 x 13` text in the top-left corner there is now a `Groups: species [2]` designation. What this means is that, behind the scenes, the table is now also split by the `species` variable and that there are two distinct groups in there.\n\nSo, if we want to see how many observations we have in each group we can use the very useful `count()` function. We don't have to specify anything - in this case it just counts the number of rows.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfinches %>% \n group_by(species) %>% \n count()\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 2 × 2\n# Groups: species [2]\n species n\n \n1 G. fortis 89\n2 G. scandens 91\n```\n\n\n:::\n:::\n\n:::\n\nThere we are, we have two distinct species of finch in these data and they more or less have an equal number of observations.\n\n### Summarising data\n\nQuite often you might find yourself in a situation where you want to get some summary statistics, based on subgroups within the data. Let's see how that works with our data.\n\nWe now know there are two species in our data. Let's imagine we wanted to know the average `weight` for each species.\n\n::: {.panel-tabset group=\"language\"}\n## R\n\nWe can use the `summarise()` function to, well, *summarise* data. The first bit indicates the name of the new column that will contain the summarised values. The part after it determines what goes into this column.\n\nHere we want the average weight, so we use `mean(weight)` to calculate this. Let's store this in a column called `avg_weight`.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfinches %>% \n group_by(species) %>% \n summarise(avg_weight = mean(weight))\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 2 × 2\n species avg_weight\n \n1 G. fortis 15.8\n2 G. scandens 19.5\n```\n\n\n:::\n:::\n\n\n:::\n\nThis gives us a table where we have the average weight for each species. We can simply expand this for any other variables, for example:\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# calculate mean, median, minimum and maximum weight per group\nfinches %>% \n group_by(species) %>% \n summarise(avg_weight = mean(weight),\n median_weight = median(weight),\n min_weight = min(weight),\n max_weight = max(weight))\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 2 × 5\n species avg_weight median_weight min_weight max_weight\n \n1 G. fortis 15.8 15.5 11.6 19.9\n2 G. scandens 19.5 19 15.4 24.4\n```\n\n\n:::\n:::\n\n\n:::\n\n## Reshaping data\n\nWhen you're analysing your data, you'll often find that you will need to structure your data in different ways, for different purposes.\n\nIdeally, you always have the same starting point where:\n\n1. Each column contains a single variable (something you're measuring).\n2. Each row is a single observation (all the measurements belonging to a single unit/person/plant etc).\n\nEven though you might still need to have your data in a different shape, having it like this as a starting point means you can always rework your data.\n\nLet's illustrate this with the following example:\n\n\n::: {.cell}\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 6 × 3\n species group n\n \n1 G. fortis Early blunt 30\n2 G. fortis Late blunt 30\n3 G. fortis Late pointed 29\n4 G. scandens Early pointed 31\n5 G. scandens Late blunt 30\n6 G. scandens Late pointed 30\n```\n\n\n:::\n:::\n\n\nHere we have count data (number of observations) for each species and group. It's quite a list and you can imagine that if you had many more species then it would become tricky to interpret. So, instead we're going to reshape the this table and have a column for each unique `group` and a row for each `species`.\n\n\n::: {.panel-tabset group=\"language\"}\n## R\n\nWe can obtain the data set above by using the `count()` function. Here we are counting by two variables: `species` and `group`.\n\nIf we want to reshape the data, we can use the `pivot_*` functions. There are two main ones:\n\n1. `pivot_longer()` creates a 'long' format data set; here each observation is a single row and data is repeated in the first column.\n2. `pivot_wider()` creates a 'wide' format data set; here data is not repeated in the first column.\n\nSo, here we are using the `pivot_wider()` function. We need to tell it where the new column names are going to come from (`names_from =`). We also need to specify where the values are coming from that are going to be used to populate the new table (`values_from =`):\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfinches %>% \n count(species, group) %>% \n pivot_wider(names_from = group, values_from = n)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 2 × 5\n species `Early blunt` `Late blunt` `Late pointed` `Early pointed`\n \n1 G. fortis 30 30 29 NA\n2 G. scandens NA 30 30 31\n```\n\n\n:::\n:::\n\n\n:::\n\nThis gives us a 'wide' table, where the original data are split by the type of `group`. We have 4 distinct groups, so we end up with one column for each group plus the original one for `species`.\n\n::: {.callout-note}\n## Long or wide?\n\nDeciding which format to use can sometimes feel a bit tricky. Relating it to plotting can be helpful. Ask yourself the question: \"what is going on the x and y axis?\". Each variable that you want to plot on either the x or y axis needs to be in its own column.\n\n:::\n## Exporting data\n\n## Summary\n\n::: {.callout-tip}\n#### Key points\n\n- \n\n:::\n", "supporting": [ "data-wrangling_files" ], diff --git a/materials/data-wrangling.qmd b/materials/data-wrangling.qmd index f1f0887..eebdf14 100644 --- a/materials/data-wrangling.qmd +++ b/materials/data-wrangling.qmd @@ -34,6 +34,12 @@ exec(open('setup-files/setup.py').read()) ## R ### Libraries + +```{r} +#| eval: false +library(tidyverse) +``` + ### Functions ::: @@ -56,7 +62,180 @@ finches <- read_csv("data/finches.csv") ``` ::: -CONTENT COMING SOON +## Creating new columns + +Sometimes you'll have to create new columns in your data set. For example, you might have a column that records something in kilograms, but you need it in milligrams. You'd then have to either convert the original column or create a new one with the new data. + +Let's see how to do this using the `weight` column from the `finches` data. + +::: {.panel-tabset group="language"} +## R + +We'll use pipes to do this, so we can see what R is doing without immediately updating the data. This is generally a useful technique: check each step one-by-one and after you're happy with the changes, *then* update the table. + +To add a column, we use the `mutate()` function. We first define the name of the *new column*, then tell it what needs to go in it. + +```{r} +finches %>% + mutate(weight_kg = weight / 1000) +``` + +You'll probably notice that our new column isn't visible on screen. This is because we have quite a few columns in our table. We can move the new column to directly after the `weight` column. We use the `relocate()` function for this. + +We tell `relocate()` which column we want to move, then use the `.after =` argument to specify where we want to insert the column. + +```{r} +finches %>% + mutate(weight_kg = weight / 1000) %>% + relocate(weight_kg, .after = weight) +``` + +::: + +We can see that the new column indeed contains the new weight measurements, composed of the original `weight` values divided by 1,000. + +Now that we know this gives us the result we want, we can update the original table: + +::: {.panel-tabset group="language"} +## R + +```{r} +finches <- finches %>% + mutate(weight_kg = weight / 1000) %>% + relocate(weight_kg, .after = weight) +``` + +::: + +## Grouping and summarising + +A very common technique used in data analysis is the "split-apply-combine". This is a three-step process, where we: + +1. Split the data into subgroups. +2. Apply a set of transformations / calculations / ... to each subgroup. +3. Combine the result into a single table. + +### Groups + +I happen to know that there are two distinct species in this data set. Let's say we're interested in finding out how many observations we have for each species. + +There are two steps to this process: + +1. We need to split the data by `species`. +2. We need to count the number of rows (= observations) in each subgroup. + +::: {.panel-tabset group="language"} +## R + +We can use the `group_by()` function to group data by a given variable. Here, we will group the data by `species`: + +```{r} +finches %>% + group_by(species) +``` + +This doesn't seem to make much difference, since it's still outputting all of the data. However, if you look closely, you will notice that next to the `A tibble: 180 x 13` text in the top-left corner there is now a `Groups: species [2]` designation. What this means is that, behind the scenes, the table is now also split by the `species` variable and that there are two distinct groups in there. + +So, if we want to see how many observations we have in each group we can use the very useful `count()` function. We don't have to specify anything - in this case it just counts the number of rows. + +```{r} +finches %>% + group_by(species) %>% + count() +``` +::: + +There we are, we have two distinct species of finch in these data and they more or less have an equal number of observations. + +### Summarising data + +Quite often you might find yourself in a situation where you want to get some summary statistics, based on subgroups within the data. Let's see how that works with our data. + +We now know there are two species in our data. Let's imagine we wanted to know the average `weight` for each species. + +::: {.panel-tabset group="language"} +## R + +We can use the `summarise()` function to, well, *summarise* data. The first bit indicates the name of the new column that will contain the summarised values. The part after it determines what goes into this column. + +Here we want the average weight, so we use `mean(weight)` to calculate this. Let's store this in a column called `avg_weight`. + +```{r} +finches %>% + group_by(species) %>% + summarise(avg_weight = mean(weight)) +``` + +::: + +This gives us a table where we have the average weight for each species. We can simply expand this for any other variables, for example: + +::: {.panel-tabset group="language"} +## R + +```{r} +# calculate mean, median, minimum and maximum weight per group +finches %>% + group_by(species) %>% + summarise(avg_weight = mean(weight), + median_weight = median(weight), + min_weight = min(weight), + max_weight = max(weight)) +``` + +::: + +## Reshaping data + +When you're analysing your data, you'll often find that you will need to structure your data in different ways, for different purposes. + +Ideally, you always have the same starting point where: + +1. Each column contains a single variable (something you're measuring). +2. Each row is a single observation (all the measurements belonging to a single unit/person/plant etc). + +Even though you might still need to have your data in a different shape, having it like this as a starting point means you can always rework your data. + +Let's illustrate this with the following example: + +```{r} +#| echo: false +finches %>% + count(species, group) +``` + +Here we have count data (number of observations) for each species and group. It's quite a list and you can imagine that if you had many more species then it would become tricky to interpret. So, instead we're going to reshape the this table and have a column for each unique `group` and a row for each `species`. + + +::: {.panel-tabset group="language"} +## R + +We can obtain the data set above by using the `count()` function. Here we are counting by two variables: `species` and `group`. + +If we want to reshape the data, we can use the `pivot_*` functions. There are two main ones: + +1. `pivot_longer()` creates a 'long' format data set; here each observation is a single row and data is repeated in the first column. +2. `pivot_wider()` creates a 'wide' format data set; here data is not repeated in the first column. + +So, here we are using the `pivot_wider()` function. We need to tell it where the new column names are going to come from (`names_from =`). We also need to specify where the values are coming from that are going to be used to populate the new table (`values_from =`): + +```{r} +finches %>% + count(species, group) %>% + pivot_wider(names_from = group, values_from = n) +``` + +::: + +This gives us a 'wide' table, where the original data are split by the type of `group`. We have 4 distinct groups, so we end up with one column for each group plus the original one for `species`. + +::: {.callout-note} +## Long or wide? + +Deciding which format to use can sometimes feel a bit tricky. Relating it to plotting can be helpful. Ask yourself the question: "what is going on the x and y axis?". Each variable that you want to plot on either the x or y axis needs to be in its own column. + +::: +## Exporting data ## Summary