Skip to content

Commit c12b0cf

Browse files
committed
added first pass of 5th section of 1st tutorial
1 parent 85f90e4 commit c12b0cf

File tree

1 file changed

+276
-0
lines changed

1 file changed

+276
-0
lines changed
Lines changed: 276 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,276 @@
1+
# Wildcards and Pipes
2+
3+
:::note[Overview]
4+
Questions
5+
- How can I run a command on multiple files at once?
6+
- Is there an easy way of saving a command’s output?
7+
8+
Objectives
9+
- Redirect a command’s output to a file.
10+
- Process a file instead of keyboard input using redirection.
11+
- Construct command pipelines with two or more stages.
12+
- Explain what usually happens if a program or pipeline isn’t given any input to process.
13+
:::
14+
15+
:::warning[Required files]
16+
If you didn’t get them in the last lesson, make sure to download the example files used in the next few sections:
17+
- Using wget: wget https://nyuhpc.github.io/hpc-shell/files/bash-lesson.tar.gz
18+
- Using a web browser: https://nyuhpc.github.io/hpc-shell/files/bash-lesson.tar.gz
19+
:::
20+
21+
Now that we know some of the basic UNIX commands, we are going to explore some more advanced features. The first of these features is the wildcard `*`. In our examples before, we’ve done things to files one at a time and otherwise had to specify things explicitly. The `*` character lets us speed things up and do things across multiple files.
22+
23+
Ever wanted to move, delete, or just do “something” to all files of a certain type in a directory? `*` lets you do that, by taking the place of one or more characters in a piece of text. So `*.txt` would be equivalent to all `.txt` files in a directory for instance. `*` by itself means all files. Let’s use our example data to see what I mean.
24+
```bash
25+
$ tar xvf bash-lesson.tar.gz
26+
x dmel-all-r6.19.gtf
27+
x dmel_unique_protein_isoforms_fb_2016_01.tsv
28+
x gene_association.fb
29+
x SRR307023_1.fastq
30+
x SRR307023_2.fastq
31+
x SRR307024_1.fastq
32+
x SRR307024_2.fastq
33+
x SRR307025_1.fastq
34+
x SRR307025_2.fastq
35+
x SRR307026_1.fastq
36+
x SRR307026_2.fastq
37+
x SRR307027_1.fastq
38+
x SRR307027_2.fastq
39+
x SRR307028_1.fastq
40+
x SRR307028_2.fastq
41+
x SRR307029_1.fastq
42+
x SRR307029_2.fastq
43+
x SRR307030_1.fastq
44+
x SRR307030_2.fastq
45+
$ ls
46+
bash-lesson.tar.gz SRR307024_2.fastq SRR307028_1.fastq
47+
dmel_unique_protein_isoforms_fb_2016_01.tsv SRR307025_1.fastq SRR307028_2.fastq
48+
dmel-all-r6.19.gtf SRR307025_2.fastq SRR307029_1.fastq
49+
gene_association.fb SRR307026_1.fastq SRR307029_2.fastq
50+
SRR307023_1.fastq SRR307026_2.fastq SRR307030_1.fastq
51+
SRR307023_2.fastq SRR307027_1.fastq SRR307030_2.fastq
52+
SRR307024_1.fastq SRR307027_2.fastq
53+
```
54+
55+
Now we have a whole bunch of example files in our directory. For this example we are going to learn a new command that tells us how long a file is: `wc`. `wc -l` file tells us the length of a file in lines.
56+
```bash
57+
$ wc -l dmel-all-r6.19.gtf
58+
542048 dmel-all-r6.19.gtf
59+
```
60+
Interesting, there are over 540000 lines in our `dmel-all-r6.19.gtf` file. What if we wanted to run `wc -l` on every `.fastq` file? This is where `*` comes in really handy! `*.fastq` would match every file ending in `.fastq`.
61+
```bash
62+
$ wc -l *.fastq
63+
20000 SRR307023_1.fastq
64+
20000 SRR307023_2.fastq
65+
20000 SRR307024_1.fastq
66+
20000 SRR307024_2.fastq
67+
20000 SRR307025_1.fastq
68+
20000 SRR307025_2.fastq
69+
20000 SRR307026_1.fastq
70+
20000 SRR307026_2.fastq
71+
20000 SRR307027_1.fastq
72+
20000 SRR307027_2.fastq
73+
20000 SRR307028_1.fastq
74+
20000 SRR307028_2.fastq
75+
20000 SRR307029_1.fastq
76+
20000 SRR307029_2.fastq
77+
20000 SRR307030_1.fastq
78+
20000 SRR307030_2.fastq
79+
320000 total
80+
```
81+
82+
That was easy. What if we wanted to do the same command, except on every file in the directory? A nice trick to keep in mind is that `*` by itself matches *every* file.
83+
```bash
84+
$ wc -l *
85+
53037 bash-lesson.tar.gz
86+
542048 dmel-all-r6.19.gtf
87+
22129 dmel_unique_protein_isoforms_fb_2016_01.tsv
88+
106290 gene_association.fb
89+
20000 SRR307023_1.fastq
90+
20000 SRR307023_2.fastq
91+
20000 SRR307024_1.fastq
92+
20000 SRR307024_2.fastq
93+
20000 SRR307025_1.fastq
94+
20000 SRR307025_2.fastq
95+
20000 SRR307026_1.fastq
96+
20000 SRR307026_2.fastq
97+
20000 SRR307027_1.fastq
98+
20000 SRR307027_2.fastq
99+
20000 SRR307028_1.fastq
100+
20000 SRR307028_2.fastq
101+
20000 SRR307029_1.fastq
102+
20000 SRR307029_2.fastq
103+
20000 SRR307030_1.fastq
104+
20000 SRR307030_2.fastq
105+
1043504 total
106+
```
107+
108+
<details>
109+
<summary>
110+
:::info[Multiple wildcards]
111+
You can even use multiple *s at a time. How would you run wc -l on every file with “fb” in it?
112+
<br />**[Click for Solution]**
113+
:::
114+
</summary>
115+
:::tip[Solution]
116+
```bash
117+
wc -l *fb*
118+
```
119+
i.e. *anything or nothing* then `fb` then *anything or nothing*
120+
:::
121+
</details>
122+
123+
<details>
124+
<summary>
125+
:::info[Using other commands]
126+
Now let’s try cleaning up our working directory a bit. Create a folder called “fastq” and move all of our .fastq files there in one `mv` command.
127+
<br />**[Click for Solution]**
128+
:::
129+
</summary>
130+
:::tip[Solution]
131+
```bash
132+
mkdir fastq
133+
mv *.fastq fastq/
134+
```
135+
:::
136+
</details>
137+
138+
## Redirecting output
139+
Each of the commands we’ve used so far does only a very small amount of work. However, we can chain these small UNIX commands together to perform otherwise complicated actions!
140+
141+
For our first foray into *piping*, or redirecting output, we are going to use the `>` operator to write output to a file. When using `>`, whatever is on the left of the `>` is written to the filename you specify on the right of the arrow. The actual syntax looks like `command > filename`.
142+
143+
Let’s try several basic usages of `>`. `echo` simply prints back, or echoes, whatever you type after it.
144+
```bash
145+
$ echo "this is a test"
146+
this is a test
147+
$ echo "this is a test" > test.txt
148+
$ ls
149+
bash-lesson.tar.gz fastq
150+
dmel-all-r6.19.gtf gene_association.fb
151+
dmel_unique_protein_isoforms_fb_2016_01.tsv test.txt
152+
$ cat test.txt
153+
this is a test
154+
```
155+
156+
Awesome, let’s try that with a more complicated command, like `wc -l`.
157+
158+
```bash
159+
$ wc -l * > word_counts.txt
160+
wc: fastq: Is a directory
161+
$ cat word_counts.txt
162+
53037 bash-lesson.tar.gz
163+
542048 dmel-all-r6.19.gtf
164+
22129 dmel_unique_protein_isoforms_fb_2016_01.tsv
165+
106290 gene_association.fb
166+
1 test.txt
167+
723505 total
168+
```
169+
170+
Notice how we still got some output to the console even though we “piped” the output to a file? Our expected output still went to the file, but how did the error message get skipped and not go to the file?
171+
172+
This phenomena is an artefact of how UNIX systems are built. There are 3 input/output streams for every UNIX program you will run: `stdin`, `stdout`, and `stderr`.
173+
174+
Let’s dissect these three streams of input/output in the command we just ran: `wc -l * > word_counts.txt`
175+
- stdin is the input to a program. In the command we just ran, `stdin` is represented by `*`, which is simply every filename in our current directory.
176+
- stdout contains the actual, expected output. In this case, `>` redirected `stdout` to the file `word_counts.txt`.
177+
- stderr typically contains error messages and other information that doesn’t quite fit into the category of “output”. If we insist on redirecting both `stdout` and `stderr` to the same file we would use `&>` instead of `>`. (We can redirect just `stderr` using `2>`.)
178+
179+
Knowing what we know now, let’s try re-running the command, and send all of the output (including the error message) to the same `word_counts.txt` files as before.
180+
```bash
181+
$ wc -l * &> word_counts.txt
182+
```
183+
Notice how there was no output to the console that time. Let’s check that the error message went to the file like we specified.
184+
```bash
185+
$ cat word_counts.txt
186+
53037 bash-lesson.tar.gz
187+
542048 dmel-all-r6.19.gtf
188+
22129 dmel_unique_protein_isoforms_fb_2016_01.tsv
189+
wc: fastq: Is a directory
190+
106290 gene_association.fb
191+
1 test.txt
192+
7 word_counts.txt
193+
723512 total
194+
```
195+
Success! The `wc: fastq: Is a directory` error message was written to the file. Also, note how the file was silently overwritten by directing output to the same place as before. Sometimes this is not the behaviour we want. How do we append (add) to a file instead of overwriting it?
196+
197+
Appending to a file is done the same was as redirecting output. However, instead of `>`, we will use `>>`.
198+
```bash
199+
$ echo "We want to add this sentence to the end of our file" >> word_counts.txt
200+
$ cat word_counts.txt
201+
22129 dmel_unique_protein_isoforms_fb_2016_01.tsv
202+
471308 Drosophila_melanogaster.BDGP5.77.gtf
203+
0 fastq
204+
1304914 fb_synonym_fb_2016_01.tsv
205+
106290 gene_association.fb
206+
1 test.txt
207+
1904642 total
208+
We want to add this sentence to the end of our file
209+
```
210+
211+
## Chaining commands together
212+
We now know how to redirect `stdout` and `stderr` to files. We can actually take this a step further and redirect output (`stdout`) from one command to serve as the input (stdin) for the next. To do this, we use the `|` (pipe) operator.
213+
214+
`grep` is an extremely useful command. It finds things for us within files. Basic usage (there are a lot of options for more clever things, see the `man` page) uses the syntax `grep whatToFind fileToSearch`. Let’s use `grep` to find all of the entries pertaining to the `Act5C` gene in Drosophila melanogaster.
215+
```bash
216+
$ grep Act5C dmel-all-r6.19.gtf
217+
```
218+
The output is nearly unintelligible since there is so much of it. Let’s send the output of that `grep` command to `head` so we can just take a peek at the first line. The `|` operator lets us send output from one command to the next:
219+
```bash
220+
$ grep Act5C dmel-all-r6.19.gtf | head -n 1
221+
X FlyBase gene 5900861 5905399 . + . gene_id "FBgn0000042"; gene_symbol "Act5C";
222+
```
223+
Nice work, we sent the output of `grep` to `head`. Let’s try counting the number of entries for Act5C with `wc -l`. We can do the same trick to send `grep`’s output to `wc -l`:
224+
```bash
225+
$ grep Act5C dmel-all-r6.19.gtf | wc -l
226+
46
227+
```
228+
:::note
229+
This is just the same as redirecting output to a file, then reading the number of lines from that file.
230+
:::
231+
232+
<details>
233+
<summary>
234+
:::info[Writing commands using pipes]
235+
How many files are there in the “fastq” directory we made earlier? (Use the shell to do this.)
236+
:::
237+
**[Click for Solution]**
238+
</summary>
239+
:::tip[Solution]
240+
```bash
241+
ls fastq/ | wc -l
242+
```
243+
Output of `ls` is one line per item, when chaining commands together like this, so counting lines gives the number of files.
244+
:::
245+
</details>
246+
247+
<details>
248+
<summary>
249+
:::info[Reading from compressed files]
250+
Let’s compress one of our files using `gzip`.
251+
```bash
252+
$ gzip gene_association.fb
253+
```
254+
`zcat` acts like `cat`, except that it can read information from `.gz` (compressed) files. Using `zcat`, can you write a command to take a look at the top few lines of the `gene_association.fb.gz` file (without decompressing the file itself)? <br />
255+
**[Click for Solution]**
256+
:::
257+
</summary>
258+
:::tip[Solution]
259+
```bash
260+
zcat gene_association.fb.gz | head
261+
```
262+
or for Mac:
263+
```bash
264+
zcat < gene_association.fb.gz | head
265+
```
266+
`zcat` works a little differently on Macs. You'll need to use `<` to explicitly input the file for `zcat`.<br />
267+
The `head` command without any options shows the first 10 lines of a file.
268+
:::
269+
</details>
270+
271+
272+
:::tip[Key Points]
273+
- The `*` wildcard is used as a placeholder to match any text that follows a pattern.
274+
- Redirect a command’s output to a file with `>`.
275+
- Commands can be chained with `|`
276+
:::

0 commit comments

Comments
 (0)