-
Notifications
You must be signed in to change notification settings - Fork 3
/
Copy pathp_untangled_subset_columns.qmd
298 lines (189 loc) · 8.36 KB
/
p_untangled_subset_columns.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
---
title: 'Subsetting columns'
---
```{python}
# | echo: false
# Setup
import pandas as pd
pd.options.display.max_rows = 5
```
## Introduction
Today we will begin our exploration of pandas for data manipulation!
Our first focus will be on selecting and renaming columns. Often your dataset comes with many columns that you do not need, and you would like to narrow it down to just a few. Pandas makes this easy. Let's see how.
## Learning objectives
- You can keep or drop columns from a DataFrame using square brackets `[]`, `filter()`, and `drop()`.
- You can select columns based on regex patterns with `filter()`.
- You can use `rename()` to change column names.
- You can use regex to clean column names.
## About pandas
Pandas is a popular library for data manipulation and analysis. It is designed to make it easy to work with tabular data in Python.
Install pandas with the following command in your terminal if it is not already installed:
```{python}
# | eval: false
pip install pandas
```
Then import pandas with the following command in your script:
```{python}
import pandas as pd
```
## The Yaounde COVID-19 dataset
In this lesson, we analyse results from a COVID-19 survey conducted in Yaounde, Cameroon in late 2020. The survey estimated how many people had been infected with COVID-19 in the region, by testing for antibodies.
You can find out more about this dataset here: [https://www.nature.com/articles/s41467-021-25946-0](https://www.nature.com/articles/s41467-021-25946-0)
To download the dataset, visit this link: [https://raw.githubusercontent.com/the-graph-courses/idap_book/main/data/yaounde_data.zip](https://raw.githubusercontent.com/the-graph-courses/idap_book/main/data/yaounde_data.zip)
Then unzip the file and place the `yaounde_data.csv` file in the `data` folder in the same directory as your notebook.
Let's load and examine the dataset:
```{python}
yao = pd.read_csv("data/yaounde_data.csv")
yao
```
## Selecting columns with square brackets `[]`
In pandas, the most common way to select a column is simply to use square brackets `[]` and the column name. For example, to select the `age` and `sex` columns, we type:
```{python}
yao[["age", "sex"]]
```
Note the double square brackets `[[]]`. Without it, you will get an error:
```{python}
# | eval: false
yao["age", "sex"]
```
```
KeyError: ('age', 'sex')
```
If you want to select a single column, you *may omit the double square brackets*, but your output will no longer be a DataFrame. Compare the following:
```{python}
yao["age"] # does not return a DataFrame
```
```{python}
yao[["age"]] # returns a DataFrame
```
::: {.callout-note title="Key Point"}
## Storing data subsets
Note that these selections are not modifying the DataFrame itself. If we want a modified version, we create a new DataFrame to store the subset. For example, below we create a subset with only three columns:
```{python}
yao_subset = yao[["age", "sex", "igg_result"]]
yao_subset
```
And if we want to overwrite a DataFrame, we can assign the subset back to the original DataFrame. Let's overwrite the `yao_subset` DataFrame to have only the `age` column:
```{python}
yao_subset = yao_subset[["age"]]
yao_subset
```
The `yao_subset` DataFrame has gone from having 3 columns to having 1 column.
:::
::: {.callout-tip title="Practice"}
### Practice Q: Select Columns with `[]`
- Use the `[]` operator to select the "weight_kg" and "height_cm" variables in the `yao` DataFrame. Assign the result to a new DataFrame called `yao_weight_height`. Then print this new DataFrame.
```{python}
# Your code here
```
:::
::: {.callout-note title="Pro tip"}
There are many ways to select columns in pandas. In your free time, you may choose to explore the `.loc[]` and `.take()` methods, which provide additional functionality.
:::
## Excluding columns with `drop()`
Sometimes it is more useful to drop columns you do not need than to explicitly select the ones that you do need.
To drop columns, we can use the `drop()` method with the `columns` argument. To drop the age column, we type:
```{python}
yao.drop(columns=["age"])
```
To drop several columns:
```{python}
yao.drop(columns=["age", "sex"])
```
Again, note that this is not modifying the DataFrame itself. If we want a modified version, we create a new DataFrame to store the subset. For example, below we create a subset age and sex dropped:
```{python}
yao_subset = yao.drop(columns=["age", "sex"])
yao_subset
```
::: {.callout-tip title="Practice"}
### Practice Q: Drop Columns with `drop()`
- From the `yao` DataFrame, **remove** the columns `highest_education` and `consultation`. Assign the result to a new DataFrame called `yao_no_education_consultation`. Print this new DataFrame.
```{python}
# Your code here
```
:::
## Using `filter()` to select columns by regex
The `filter()` method and its `regex` argument offer a powerful way to select columns based on patterns in their names. As an example, to select columns containing the string "ig", we can write:
```{python}
yao.filter(regex="ig")
```
The argument `regex` specifies the pattern to match. Regex stands for regular expression and refers to a sequence of characters that define a search pattern.
To select columns **starting with** the string "ig", we write:
```{python}
yao.filter(regex="^ig")
```
The symbol `^` is a regex character that matches the beginning of the string.
To select columns **ending with** the string "result", we can write:
```{python}
yao.filter(regex="result$")
```
The character `$` is regex that matches the end of the string.
::: {.callout-note title="Pro Tip"}
Regex is notoriously difficult to remember, but LLMs like ChatGPT are very good at generating the right patterns. Simply ask, for example, "What is the regex for strings starting with 'ig'"
:::
::: {.callout-tip title="Practice"}
### Practice Q: Select Columns with Regex
- Select all columns in the `yao` DataFrame that start with "is_". Assign the result to a new DataFrame called `yao_is_columns`. Then print this new DataFrame.
```{python}
# Your code here
```
:::
## Change column names with `rename()`
We can use the `rename()` method to change column names:
```{python}
yao.rename(columns={"age": "patient_age", "sex": "patient_sex"})
```
::: {.callout-tip title="Practice"}
### Practice Q: Rename Columns with `rename()`
- Rename the `age_category` column in the `yao` DataFrame to `age_cat`. Assign the result to a new DataFrame called `yao_age_cat`. Then print this new DataFrame.
```{python}
# Your code here
```
:::
## Cleaning messy column names
For cleaning column names, you can use regular expressions with the `str.replace()` method in pandas.
Here's how you can do it on a test DataFrame with messy column names. Messy column names are names with spaces, special characters, or other non-alphanumeric characters.
```{python}
test_df = pd.DataFrame(
{"good_name": range(3), "bad name": range(3), "bad*@name*2": range(3)}
)
test_df
```
Such column names are not ideal because, for example, we cannot select them with the dot operator the way we can for clean names:
```{python}
test_df.good_name # this works
```
But this does not work:
```{python}
# | eval: false
test_df.bad name
```
```
test_df.bad name
^
SyntaxError: invalid syntax
```
We can automatically clean such names using the `str.replace()` method along with regular expressions.
```{python}
clean_names = test_df.columns.str.replace(r'[^a-zA-Z0-9]', '_', regex=True)
```
The regular expression `r'[^a-zA-Z0-9]'` matches any character that is not a letter (either uppercase or lowercase) or a digit. The `str.replace()` method replaces these characters with an underscore ('\_') to make the column names more legible and usable in dot notation.
Now we can replace the column names in the DataFrame with the cleaned names:
```{python}
test_df.columns = clean_names
test_df
```
::: {.callout-tip title="Practice"}
### Practice Q: Clean Column Names with Regex
- Consider the data frame defined below with messy column names. Use the `str.replace()` method to clean the column names.
```{python}
cleaning_practice = pd.DataFrame(
{"Aloha": range(3), "Bell Chart": range(3), "Animals@the zoo": range(3)}
)
cleaning_practice
```
:::
---
## Wrap up
Hopefully this lesson has shown you how intuitive and useful pandas is for data manipulation!
This is the first of a series of basic data wrangling techniques: see you in the next lesson to learn more.