layout | title |
---|---|
presentation |
Week 2, session 2: more string handling, basic debugging |
class: title
5CCYB041
We continue working on our DNA shotgun sequencing project
You can find the most up to date version in the project's solution/
folder
.explain-bottom[ Make sure your code is up to date now! ]
layout: true
Let's imagine that this is the current estimate of the sequence:
CTGAATGCTTGGGCTGAAAGGGCGCGAGACGTATTCCCCGGTTGCAGACG
and this is a candidate fragment:
CCCTCATCACAACTGAATGCTTGGGCTGAA
We could compute the overlap by comparing the overlapping region, starting from the smallest overlap:
`C`TGAATGCTTGGGCTGAAAGGGCGCGAGACGTATTCCCCGGTTGC
CCCTCATCACAACTGAATGCTTGGGCTGA`A`
`CT`GAATGCTTGGGCTGAAAGGGCGCGAGACGTATTCCCCGGTTGC
CCCTCATCACAACTGAATGCTTGGGCTG`AA`
`CTG`AATGCTTGGGCTGAAAGGGCGCGAGACGTATTCCCCGGTTGC
CCCTCATCACAACTGAATGCTTGGGCT`GAA`
`CTGA`ATGCTTGGGCTGAAAGGGCGCGAGACGTATTCCCCGGTTGC
CCCTCATCACAACTGAATGCTTGGGC`TGAA`
`CTGAA`TGCTTGGGCTGAAAGGGCGCGAGACGTATTCCCCGGTTGC
CCCTCATCACAACTGAATGCTTGGG`CTGAA`
We have a match with an overlap of 5 bases!
- should we stop now?
`CTGAATGCTTGGGCTGAA`AGGGCGCGAGACGTATTCCCCGGTTGC
CCCTCATCACAA`CTGAATGCTTGGGCTGAA`
No! If we carry on, we'll find a larger overlap
layout: true
A better strategy is to start with the biggest overlap first, and gradually reduce it
- that way, if we do find a match, we can stop immediately
`CTGAATGCTTGGGCTGAAAGGGCGCGAGAC`GTATTCCCCGGTTGC
`CCCTCATCACAACTGAATGCTTGGGCTGAA`
`CTGAATGCTTGGGCTGAAAGGGCGCGAGA`CGTATTCCCCGGTTGC
C`CCTCATCACAACTGAATGCTTGGGCTGAA`
`CTGAATGCTTGGGCTGAAAGGGCGCGAG`ACGTATTCCCCGGTTGC
CC`CTCATCACAACTGAATGCTTGGGCTGAA`
`CTGAATGCTTGGGCTGAAAGGGCGCGA`GACGTATTCCCCGGTTGC
CCC`TCATCACAACTGAATGCTTGGGCTGAA`
`CTGAATGCTTGGGCTGAAAGGGCGCG`AGACGTATTCCCCGGTTGC
CCCT`CATCACAACTGAATGCTTGGGCTGAA`
`CTGAATGCTTGGGCTGAAAGGGCGC`GAGACGTATTCCCCGGTTGC
CCCTC`ATCACAACTGAATGCTTGGGCTGAA`
`CTGAATGCTTGGGCTGAAAGGGCG`CGAGACGTATTCCCCGGTTGC
CCCTCA`TCACAACTGAATGCTTGGGCTGAA`
`CTGAATGCTTGGGCTGAAAGGGC`GCGAGACGTATTCCCCGGTTGC
CCCTCAT`CACAACTGAATGCTTGGGCTGAA`
`CTGAATGCTTGGGCTGAAAGGG`CGCGAGACGTATTCCCCGGTTGC
CCCTCATC`ACAACTGAATGCTTGGGCTGAA`
`CTGAATGCTTGGGCTGAAAGG`GCGCGAGACGTATTCCCCGGTTGC
CCCTCATCA`CAACTGAATGCTTGGGCTGAA`
`CTGAATGCTTGGGCTGAAAG`GGCGCGAGACGTATTCCCCGGTTGC
CCCTCATCAC`AACTGAATGCTTGGGCTGAA`
`CTGAATGCTTGGGCTGAAA`GGGCGCGAGACGTATTCCCCGGTTGC
CCCTCATCACA`ACTGAATGCTTGGGCTGAA`
`CTGAATGCTTGGGCTGAA`AGGGCGCGAGACGTATTCCCCGGTTGC
CCCTCATCACAA`CTGAATGCTTGGGCTGAA`
Now as soon as we find an overlap, it is guaranteed to be the largest
layout: false
-
Add a function to compute the overlap between the current sequence and a candidate fragment
- make sure it works for both ends of the string
-
Use this function to identify the candidate fragment with the largest overlap with current sequence
-
Add a function to merge this candidate fragment with the current sequence, given the computed overlap
-
Use these functions to iteratively merge candidates fragments until no overlapping fragments remain
-
Check that all unmerged fragments are already contained within the sequence
--
.explain-bottom[ Exercise: implement code to do this ]
We create a header file overlap.h
to hold our function declaration
#pragma once
#include <string>
int compute_overlap (const std::string& sequence, const std::string& fragment);
We create a header file overlap.h
to hold our function declaration
#pragma once
#include <string>
`int` compute_overlap (const std::string& sequence, const std::string& fragment);
- we return an integer representing the number of bases in the overlap
We create a header file overlap.h
to hold our function declaration
#pragma once
#include <string>
int `compute_overlap` (const std::string& sequence, const std::string& fragment);
- we return an integer representing the number of bases in the overlap
- we chose a suitable name for our function that clearly states its purpose
We create a header file overlap.h
to hold our function declaration
#pragma once
#include <string>
int compute_overlap (`const std::string& sequence, const std::string& fragment`);
- we return an integer representing the number of bases in the overlap
- we chose a suitable name for our function that clearly states its purpose
- we pass both the sequence and candidate fragment as const references to strings
We create a header file overlap.h
to hold our function declaration
#pragma once
*#include <string>
int compute_overlap (const std::string& sequence, const std::string& fragment);
- we return an integer representing the number of bases in the overlap
- we chose a suitable name for our function that clearly states its purpose
- we pass both the sequence and candidate fragment as const references to strings
- we need to
#include
the<string>
header since we are usingstd::string
We create a header file overlap.h
to hold our function declaration
*#pragma once
#include <string>
int compute_overlap (const std::string& sequence, const std::string& fragment);
- we return an integer representing the number of bases in the overlap
- we chose a suitable name for our function that clearly states its purpose
- we pass both the sequence and candidate fragment as const references to strings
- we need to
#include
the<string>
header since we are usingstd::string
- we use the
#pragma once
directive to avoid double-inclusion
Next, we create the overlap.cpp
file to hold our function definition
#include <iostream>
#include <stdexcept>
#include <string>
#include "overlap.h"
int compute_overlap (const std::string& sequence, const std::string& fragment)
{
// this is where our code will go
}
Next, we create the overlap.cpp
file to hold our function definition
#include <iostream>
#include <stdexcept>
#include <string>
#include "overlap.h"
*int compute_overlap (const std::string& sequence, const std::string& fragment)
{
// this is where our code will go
}
- we need to replicate the function declaration, making sure it matches exactly
Next, we create the overlap.cpp
file to hold our function definition
#include <iostream>
#include <stdexcept>
#include <string>
*#include "overlap.h"
int compute_overlap (const std::string& sequence, const std::string& fragment)
{
// this is where our code will go
}
- we need to replicate the function declaration, making sure it matches exactly
- we need to
#include
the matching header file
Next, we create the overlap.cpp
file to hold our function definition
*#include <iostream>
*#include <stdexcept>
*#include <string>
#include "overlap.h"
int compute_overlap (const std::string& sequence, const std::string& fragment)
{
// this is where our code will go
}
- we need to replicate the function declaration, making sure it matches exactly
- we need to
#include
the matching header file - we need to
#include
the system headers for the functionality we will be using
Next, we create the overlap.cpp
file to hold our function definition
#include <iostream>
#include <stdexcept>
#include <string>
#include "overlap.h"
int compute_overlap (const std::string& sequence, const std::string& fragment)
{
* // this is where our code will go
}
- we need to replicate the function declaration, making sure it matches exactly
- we need to
#include
the matching header file - we need to
#include
the system headers for the functionality we will be using - now we can worry about what will go in the body of our function
Now we fill out the body of our compute_overlap()
function
int compute_overlap (const std::string& sequence, const std::string& fragment)
{
if (fragment.size() > sequence.size())
throw std::runtime_error ("fragment size larger than current sequence!");
std::cerr << "sequence is \"" << sequence << "\"\n";
std::cerr << "trying fragment \"" << fragment << "\"\n";
...
return 0;
}
Now we fill out the body of our compute_overlap()
function
int compute_overlap (const std::string& sequence, const std::string& fragment)
{
* if (fragment.size() > sequence.size())
* throw std::runtime_error ("fragment size larger than current sequence!");
std::cerr << "sequence is \"" << sequence << "\"\n";
std::cerr << "trying fragment \"" << fragment << "\"\n";
...
return 0;
}
- always check any assumptions and expectations upfront – this can save you a lot of frustration later...
Now we fill out the body of our compute_overlap()
function
int compute_overlap (const std::string& sequence, const std::string& fragment)
{
if (fragment.size() > sequence.size())
throw std::runtime_error ("fragment size larger than current sequence!");
* std::cerr << "sequence is \"" << sequence << "\"\n";
* std::cerr << "trying fragment \"" << fragment << "\"\n";
...
return 0;
}
- always check any assumptions and expectations upfront – this can save you a lot of frustration later...
- during the development phase, don't hesitate to display lots of information, it will often alert you to where things might be going wrong.
Now we fill out the body of our compute_overlap()
function
int compute_overlap (const std::string& sequence, const std::string& fragment)
{
if (fragment.size() > sequence.size())
throw std::runtime_error ("fragment size larger than current sequence!");
std::cerr << "sequence is \"" << sequence << "\"\n";
std::cerr << "trying fragment \"" << fragment << "\"\n";
...
* return 0;
}
- always check any assumptions and expectations upfront – this can save you a lot of frustration later...
- during the development phase, don't hesitate to display lots of information, it will often alert you to where things might be going wrong.
- we need to return an integer, so let's return a value of zero (no overlap) by default
Now we fill out the body of our compute_overlap()
function
int compute_overlap (const std::string& sequence, const std::string& fragment)
{
if (fragment.size() > sequence.size())
throw std::runtime_error ("fragment size larger than current sequence!");
std::cerr << "sequence is \"" << sequence << "\"\n";
std::cerr << "trying fragment \"" << fragment << "\"\n";
* ...
return 0;
}
- always check any assumptions and expectations upfront – this can save you a lot of frustration later...
- during the development phase, don't hesitate to display lots of information, it will often alert you to where things might be going wrong.
- we need to return an integer, so let's return a value of zero (no overlap) by default
.explain-bottom[ Now we can focus on what happens here... ]
for (int overlap = fragment.size(); overlap > 0; --overlap) {
const auto seq_start = sequence.substr(0, overlap);
const auto frag_end = fragment.substr(fragment.size()-overlap);
if (seq_start == frag_end)
return overlap;
}
* for (int overlap = fragment.size(); overlap > 0; --overlap) {
const auto seq_start = sequence.substr(0, overlap);
const auto frag_end = fragment.substr(fragment.size()-overlap);
if (seq_start == frag_end)
return overlap;
* }
We iterate over the range of valid overlaps:
- starting from the largest possible overlap (the size of the candidate fragment)
- reducing the size of the overlap at each iteration
- until the overlap is zero
for (int overlap = fragment.size(); overlap > 0; --overlap) {
* const auto seq_start = sequence.substr(0, overlap);
const auto frag_end = fragment.substr(fragment.size()-overlap);
if (seq_start == frag_end)
return overlap;
}
We iterate over the range of valid overlaps:
- starting from the largest possible overlap (the size of the candidate fragment)
- reducing the size of the overlap at each iteration
- until the overlap is zero
We extract the part of the current sequence that corresponds to that overlap
- to do this, we use the string
substr()
method
for (int overlap = fragment.size(); overlap > 0; --overlap) {
const auto seq_start = sequence.substr(0, overlap);
* const auto frag_end = fragment.substr(fragment.size()-overlap);
if (seq_start == frag_end)
return overlap;
}
We iterate over the range of valid overlaps:
- starting from the largest possible overlap (the size of the candidate fragment)
- reducing the size of the overlap at each iteration
- until the overlap is zero
We extract the part of the current sequence that corresponds to that overlap
- to do this, we use the string
substr()
method - and likewise for the candidate sequence – though it is now the end of the string
for (int overlap = fragment.size(); overlap > 0; --overlap) {
const auto seq_start = sequence.substr(0, overlap);
const auto frag_end = fragment.substr(fragment.size()-overlap);
* if (seq_start == frag_end)
* return overlap;
}
We iterate over the range of valid overlaps:
- starting from the largest possible overlap (the size of the candidate fragment)
- reducing the size of the overlap at each iteration
- until the overlap is zero
We extract the part of the current sequence that corresponds to that overlap
- to do this, we use the string
substr()
method - and likewise for the candidate sequence – though it is now the end of the string
Finally, we check whether both substrings match
- If so, we can immediately return the size of the current overlap
for (int overlap = fragment.size(); overlap > 0; --overlap) {
const auto seq_start = sequence.substr(0, overlap);
const auto frag_end = `fragment.substr(fragment.size()-overlap)`;
if (seq_start == frag_end)
return overlap;
}
We iterate over the range of valid overlaps:
- starting from the largest possible overlap (the size of the candidate fragment)
- reducing the size of the overlap at each iteration
- until the overlap is zero
We extract the part of the current sequence that corresponds to that overlap
- to do this, we use the string
substr()
method - and likewise for the candidate sequence – though it is now the end of the string
Finally, we check whether both substrings match
- If so, we can immediately return the size of the current overlap
.explain-bottom[
Note that in this call to substr()
, we only provided one argument, whereas we
had provided two arguments in the previous call (position and length of the
substring).
This is because the arguments for this call have been given *default values*: - if unspecified, the position is zero, and the length is the maximum value possible. - we will cover [default arguments in C++](https://www.geeksforgeeks.org/default-arguments-c/) later in the course. ]
Here is our function definition in full:
int compute_overlap (const std::string& sequence, const std::string& fragment)
{
if (fragment.size() > sequence.size())
throw std::runtime_error ("fragment size larger than current sequence!");
std::cerr << "sequence is \"" << sequence << "\"\n";
std::cerr << "trying fragment \"" << fragment << "\"\n";
for (int overlap = fragment.size(); overlap > 0; --overlap) {
const auto seq_start = sequence.substr(0, overlap);
const auto frag_end = fragment.substr(fragment.size()-overlap);
if (seq_start == frag_end)
return overlap;
}
return 0;
}
-
Add a function to compute the overlap between the current sequence and a candidate fragment
- make sure it works for both ends of the string
-
Use this function to identify the candidate fragment with the largest overlap with current sequence
-
Add a function to merge this candidate fragment with the current sequence, given the computed overlap
-
Use these functions to iteratively merge candidates fragments until no overlapping fragments remain
-
Check that all unmerged fragments are already contained within the sequence
int compute_overlap (const std::string& sequence, const std::string& fragment)
{
if (fragment.size() > sequence.size())
throw std::runtime_error ("fragment size larger than current sequence!");
std::cerr << "sequence is \"" << sequence << "\"\n";
std::cerr << "trying fragment \"" << fragment << "\"\n";
for (int overlap = fragment.size(); overlap > 0; --overlap) {
const auto seq_start = sequence.substr(0, overlap);
const auto frag_end = fragment.substr(fragment.size()-overlap);
if (seq_start == frag_end)
return overlap;
}
return 0;
}
.explain-bottom[ We need to modify our code to also check for overlap at the other end of the sequence ]
int compute_overlap (const std::string& sequence, const std::string& fragment)
{
if (fragment.size() > sequence.size())
throw std::runtime_error ("fragment size larger than current sequence!");
std::cerr << "sequence is \"" << sequence << "\"\n";
std::cerr << "trying fragment \"" << fragment << "\"\n";
for (int overlap = fragment.size(); overlap > 0; --overlap) {
const auto seq_start = sequence.substr(0, overlap);
const auto frag_end = fragment.substr(fragment.size()-overlap);
if (seq_start == frag_end)
* return overlap;
}
return 0;
}
.explain-bottom[ We can no longer return the overlap here, since there may be a larger overlap at the other end.
We need to check both ends, and return the larger of the two. ]
int compute_overlap (const std::string& sequence, const std::string& fragment)
{
if (fragment.size() > sequence.size())
throw std::runtime_error ("fragment size larger than current sequence!");
std::cerr << "sequence is \"" << sequence << "\"\n";
std::cerr << "trying fragment \"" << fragment << "\"\n";
* int largest_overlap = 0;
for (int overlap = fragment.size(); overlap > 0; --overlap) {
const auto seq_start = sequence.substr(0, overlap);
const auto frag_end = fragment.substr(fragment.size()-overlap);
if (seq_start == frag_end) {
largest_overlap = overlap;
break;
}
}
...
.explain-top[ To do this, we declare an integer variable to hold the value of the largest overlap detected so far ]
int compute_overlap (const std::string& sequence, const std::string& fragment)
{
if (fragment.size() > sequence.size())
throw std::runtime_error ("fragment size larger than current sequence!");
std::cerr << "sequence is \"" << sequence << "\"\n";
std::cerr << "trying fragment \"" << fragment << "\"\n";
int largest_overlap = 0;
for (int overlap = fragment.size(); overlap > 0; --overlap) {
const auto seq_start = sequence.substr(0, overlap);
const auto frag_end = fragment.substr(fragment.size()-overlap);
* if (seq_start == frag_end) {
* largest_overlap = overlap;
* break;
* }
}
...
.explain-top[
Now, instead of returning the value of the current overlap, we record this
value, and use the break
statement to stop the enclosing loop and
proceed with the following code.
]
...
if (seq_start == frag_end) {
largest_overlap = overlap;
break;
}
}
for (int overlap = fragment.size(); overlap > largest_overlap; --overlap) {
const auto seq_end = sequence.substr(sequence.size() - overlap);
const auto frag_start = fragment.substr(0, overlap);
if (seq_end == frag_start) {
largest_overlap = -overlap;
break;
}
}
return largest_overlap;
}
--
.explain-top[ We now check the other end of the string, using the same idea as before – with a few differences ]
...
if (seq_start == frag_end) {
largest_overlap = overlap;
break;
}
}
for (int overlap = fragment.size(); overlap > largest_overlap; --overlap) {
* const auto seq_end = sequence.substr(sequence.size() - overlap);
* const auto frag_start = fragment.substr(0, overlap);
if (seq_end == frag_start) {
largest_overlap = -overlap;
break;
}
}
return largest_overlap;
}
.explain-top[ This time, we extract the substrings from the opposite ends of both the current sequence and the candidate fragment ]
...
if (seq_start == frag_end) {
largest_overlap = overlap;
break;
}
}
for (int overlap = fragment.size(); overlap > `largest_overlap`; --overlap) {
const auto seq_end = sequence.substr(sequence.size() - overlap);
const auto frag_start = fragment.substr(0, overlap);
if (seq_end == frag_start) {
largest_overlap = -overlap;
break;
}
}
return largest_overlap;
}
.explain-top[ There is no point iterating all the way to zero, since we already know the size of the largest overlap from the previous loop
- there is no point checking smaller overlaps than that! ]
...
if (seq_start == frag_end) {
largest_overlap = overlap;
break;
}
}
for (int overlap = fragment.size(); overlap > largest_overlap; --overlap) {
const auto seq_end = sequence.substr(sequence.size() - overlap);
const auto frag_start = fragment.substr(0, overlap);
if (seq_end == frag_start) {
* largest_overlap = -overlap;
break;
}
}
return largest_overlap;
}
.explain-top[ ... and we record the overlap as a negative value.
- this will be our way of representing an overlap at the tail end of the sequence
- this is guaranteed to be the largest overlap so far, since in this
for
loop, we are only considering overlaps larger than encountered at the other end of the sequence ]
...
if (seq_start == frag_end) {
largest_overlap = overlap;
break;
}
}
for (int overlap = fragment.size(); overlap > largest_overlap; --overlap) {
const auto seq_end = sequence.substr(sequence.size() - overlap);
const auto frag_start = fragment.substr(0, overlap);
if (seq_end == frag_start) {
largest_overlap = -overlap;
break;
}
}
* return largest_overlap;
}
.explain-top[ Finally, we end by returning the largest overlap recorded in both loops ]
-
Add a function to compute the overlap between the current sequence and a candidate fragment
- make sure it works for both ends of the string
-
Use this function to identify the candidate fragment with the largest overlap with current sequence
-
Add a function to merge this candidate fragment with the current sequence, given the computed overlap
-
Use these functions to iteratively merge candidates fragments until no overlapping fragments remain
-
Check that all unmerged fragments are already contained within the sequence
...
int biggest_overlap = 0;
int fragment_with_biggest_overlap = -1;
for (int n = 0; n < fragments.size(); ++n) {
const auto overlap = compute_overlap (sequence, fragments[n]);
if (std::abs (biggest_overlap) < std::abs (overlap)) {
biggest_overlap = overlap;
fragment_with_biggest_overlap = n;
}
}
if (fragment_with_biggest_overlap >= 0)
std::cerr << "fragment with biggest overlap is at index "
<< fragment_with_biggest_overlap
<< ", overlap = " << biggest_overlap << "\n";
write_sequence (args[2], sequence);
}
...
int biggest_overlap = 0;
int fragment_with_biggest_overlap = -1;
* for (int n = 0; n < fragments.size(); ++n) {
const auto overlap = compute_overlap (sequence, fragments[n]);
if (std::abs (biggest_overlap) < std::abs (overlap)) {
biggest_overlap = overlap;
fragment_with_biggest_overlap = n;
}
* }
if (fragment_with_biggest_overlap >= 0)
std::cerr << "fragment with biggest overlap is at index "
<< fragment_with_biggest_overlap
<< ", overlap = " << biggest_overlap << "\n";
write_sequence (args[2], sequence);
}
.explain-bottom[ We iterate over all candidate fragments
- we use a regular
for
loop (rather than a range-basedfor
loop) since we want to keep track of the index of the fragment ]
...
int biggest_overlap = 0;
int fragment_with_biggest_overlap = -1;
for (int n = 0; n < fragments.size(); ++n) {
* const auto overlap = compute_overlap (sequence, fragments[n]);
if (std::abs (biggest_overlap) < std::abs (overlap)) {
biggest_overlap = overlap;
fragment_with_biggest_overlap = n;
}
}
if (fragment_with_biggest_overlap >= 0)
std::cerr << "fragment with biggest overlap is at index "
<< fragment_with_biggest_overlap
<< ", overlap = " << biggest_overlap << "\n";
write_sequence (args[2], sequence);
}
.explain-bottom[ We use our new function to compute the overlap for the current candidate fragment
- we store the result in a block scope variable
- we declare this variable
const
since we have no intention of modifying it any further- this is good practice as we can rely on the compiler to check that we do not trip ourselves up!
- we allow the compiler to deduce the type using the
auto
keyword- the compiler will assume we want to use the type returned by
compute_overlap()
- we could just easily have declared it as
int
... ]
- the compiler will assume we want to use the type returned by
...
* int biggest_overlap = 0;
int fragment_with_biggest_overlap = -1;
for (int n = 0; n < fragments.size(); ++n) {
const auto overlap = compute_overlap (sequence, fragments[n]);
* if (std::abs (biggest_overlap) < std::abs (overlap)) {
biggest_overlap = overlap;
fragment_with_biggest_overlap = n;
}
}
if (fragment_with_biggest_overlap >= 0)
std::cerr << "fragment with biggest overlap is at index "
<< fragment_with_biggest_overlap
<< ", overlap = " << biggest_overlap << "\n";
write_sequence (args[2], sequence);
}
.explain-bottom[
We can then compare the extracted portions using the ==
comparison operator
If they match, we can immediately return the value of the overlap ]
.explain-bottom[ We check whether the overlap for the current fragment exceeds the biggest overlap we have detected so far
- since we have allowed overlaps to be negative, we need to take the absolute value of the overlaps prior to comparison
- we also need to make sure we have declared a variable to hold the value of the biggest overlap detected so far, and initialise it with the smallest value possible ]
...
int biggest_overlap = 0;
* int fragment_with_biggest_overlap = -1;
for (int n = 0; n < fragments.size(); ++n) {
const auto overlap = compute_overlap (sequence, fragments[n]);
if (std::abs (biggest_overlap) < std::abs (overlap)) {
* biggest_overlap = overlap;
* fragment_with_biggest_overlap = n;
}
}
if (fragment_with_biggest_overlap >= 0)
std::cerr << "fragment with biggest overlap is at index "
<< fragment_with_biggest_overlap
<< ", overlap = " << biggest_overlap << "\n";
write_sequence (args[2], sequence);
}
.explain-bottom[
If we make through to the last iteration, there is no match, so we simply
return 0;
]
.explain-bottom[ If the overlap is the biggest so far, record its value, and the index of the corresponding fragment
- to do this, we also need to declare a variable to hold the index of the corresponding fragment
- we also need to initialise the value of this index. In this case, it's best to assign it an invalid value (for example, a negative index) ]
-
Add a function to compute the overlap between the current sequence and a candidate fragment
- make sure it works for both ends of the string
-
Use this function to identify the candidate fragment with the largest overlap with current sequence
-
Add a function to merge this candidate fragment with the current sequence, given the computed overlap
-
Use these functions to iteratively merge candidates fragments until no overlapping fragments remain
-
Check that all unmerged fragments are already contained within the sequence
Once we've identified the candidate fragment with the largest overlap, we need to merge it with the current sequence to produce a longer, updated sequence:
--
`CTGAATGCTTGGGCTGAA`AGGGCGCGAGACGTATTCCCCGGTTGC
CCCTCATCACAA`CTGAATGCTTGGGCTGAA`
--
.center[ ⇓ ]
CCCTCATCACAA`CTGAATGCTTGGGCTGAA`AGGGCGCGAGACGTATTCCCCGGTTGC
--
Let's implement a function to do this. This is what our function *declaration* should look like. We can add it to `overlap.h`: ``` void merge (std::string& sequence, const std::string& fragment, const int overlap); ```
Next, add our function definition to overlap.cpp
:
void merge (std::string& sequence, const std::string& fragment, const int overlap)
{
if (overlap < 0) {
sequence += fragment.substr (-overlap);
}
else if (overlap > 0) {
sequence = fragment.substr (0, fragment.size()-overlap) + sequence;
}
}
-- .explain-bottom[
- we concatenate the non-overlapping portion of the fragment with the sequence
- the sign of overlap indicates which end the overlap corresponds to
- hence whether to append or prepend the fragment to the sequence ]
Use merge()
function in main()
:
...
if (fragment_with_biggest_overlap >= 0) {
std::cerr << "fragment with biggest overlap is at index "
<< fragment_with_biggest_overlap
<< ", overlap = " << biggest_overlap << "\n";
* merge (sequence, fragments[fragment_with_biggest_overlap], biggest_overlap);
}
std::cerr << "final sequence has length " << sequence.size() << "\n";
write_sequence (args[2], sequence);
}
-
Add a function to compute the overlap between the current sequence and a candidate fragment
- make sure it works for both ends of the string
-
Use this function to identify the candidate fragment with the largest overlap with current sequence
-
Add a function to merge this candidate fragment with the current sequence, given the computed overlap
-
Use these functions to iteratively merge candidates fragments until no overlapping fragments remain
-
Check that all unmerged fragments are already contained within the sequence
* while (fragments.size()) {
* std::cerr << "---------------------------------------------------\n";
* std::cerr << fragments.size() << " fragments left\n";
int biggest_overlap = 0;
int fragment_with_biggest_overlap = -1;
for (int n = 0; n < static_cast<int> (fragments.size()); ++n) {
...
}
* if (fragment_with_biggest_overlap < 0)
* break;
* if (std::abs (biggest_overlap) < 10)
* break;
std::cerr << "fragment with biggest overlap is at index "
<< fragment_with_biggest_overlap
<< ", overlap = " << biggest_overlap << "\n";
merge (sequence, fragments[fragment_with_biggest_overlap], biggest_overlap);
* fragments.erase (fragments.begin() + fragment_with_biggest_overlap);
* }
-
Add a function to compute the overlap between the current sequence and a candidate fragment
- make sure it works for both ends of the string
-
Use this function to identify the candidate fragment with the largest overlap with current sequence
-
Add a function to merge this candidate fragment with the current sequence, given the computed overlap
-
Use these functions to iteratively merge candidates fragments until no overlapping fragments remain
-
Check that all unmerged fragments are already contained within the sequence
...
fragments.erase (fragments.begin() + fragment_with_biggest_overlap);
}
* int num_unmatched = 0;
* for (const auto& frag : fragments) {
* if (sequence.find (frag) == std::string::npos)
* ++num_unmatched;
* }
* if (num_unmatched)
* std::cerr << "WARNING: " << num_unmatched << " fragments remain unmatched!\n";
std::cerr << "final sequence has length " << sequence.size() << "\n";
write_sequence (args[2], sequence);
}
name: cmdline_options
class: section
We've already seen examples of command-line options in other commands:
$ ls `-l`
$ g++ `-std=c++20` shotgun.cpp `-o` shotgun
--
The purpose of a command-line option is typically to alter the behaviour of the command in some way
--
Can we use that to request additional debugging information?
- it is a common practice for commands to provide an option to enable verbose mode
- this is really useful to figure out exactly what is going wrong when running your program!
--
A typical option to enable verbose mode is -v
- let's try to implement this in our project
There are many ways to handle command-line options
- indeed, there are many libraries dedicated to command-line parsing...
--
Have a go at implementing this simple approach to detect the -v
option:
- start by checking whether the option has been provided in the arguments list
- if it has, then:
- take whatever action is appropriate (e.g. set a
verbose
variable totrue
) - remove that option from the argument list
- take whatever action is appropriate (e.g. set a
- process your arguments as previously
--
.explain-bottom[
Hint: there are many ways of doing this, but you may find the new C++20
std::erase()
function very useful here (if you can work out how to use it from the
documentation...)
]
Possible solution:
...
void run (std::vector<std::string>& args)
{
* bool verbose = std::erase (args, "-v");
if (args.size() < 3)
throw std::runtime_error ("expected 2 arguments: input_fragments output_sequence");
...
}
--
std::erase()
will erase any occurence of the value ("-v"
) from the
container (args
), and return the number of occurences removed.
- if no
-v
in the argument list, the return value is zero, which equates tofalse
when assigned to abool
- if there were any occurences of
-v
, a non-zero value will be returned, which equates totrue
One problem with the previous solution is that verbose
is a local variable
- it is only accessible within the
run()
function
--
If we call other functions, the code in these functions will have no knowledge
or access to our verbose
variable
--
We could pass our verbose
variable as an argument to all relevant functions
- but this very rapidly becomes cumbersome and impractical
--
A better option in this particular case is to use a global variable
A global variable is declared outside of any function and is accessible from anywhere in the code
- provided its declaration is encountered before the point of use within each translation unit
--
In general, global variables are strongly discouraged
- they introduce scope for state changes that may affect how different part of the program operate, and make it difficult to ensure predictable behaviour (see e.g. this article for details)
--
There are however cases where they make sense:
- for constants (in which case they would also be declared
const
) -- - when they represent a truly unique entity
- examples include
std::cout
,std::cerr
: there can only be one of each in any program - in our case, there can only ever be one setting representing verbose mode!
- examples include
We could declare our global variable like this:
...
*bool verbose = false;
...
void run (std::vector<std::string>& args)
{
* verbose = std::erase (args, "-v");
...
But there are still issues with this:
- this is declared within our
shotgun.cpp
file – it won't be accessible in our othercpp
files- they won't be in the same translation unit --
- what happens if another function declares their own local variable called
verbose
?
name: scope
The scope of a variable determines its lifetime and where it can be accessed from.
Global scope
- Variables are instantiated (created) before
main()
starts, and destroyed aftermain()
returns - they are accessible from any part of the code
- provided their declaration occurs before their point of use in the current translation unit
--
Function scope
- Variables are instantiated when declared, and destroyed when the function returns
- They are only accessible within their enclosing function, after their declaration
- Each invocation of a given function will have its own version of its local variables
--
Block scope
- Block scope means within a compound statement (e.g. within a
for
loop orif
statement) - Variables are instantiated when declared, and destroyed at the end of the block
- They are only accessible within their block, after their declaration
name: shadowing
Variable shadowing (also called name hiding) occurs when two variables of the same name co-exist in different scopes
int count = 0; // <= global scope
...
void compute ()
{
int count = 10; // <= function scope
...
for (int count = 0; count < 20; ++count) { // <= block scope
std::cout << count << "\n";
...
}
std::cout << count << "\n";
}
--
This is perfectly legal in C++!
Variable shadowing (also called name hiding) occurs when two variables of the same name co-exist in different scopes
*int count = 0; // <= global scope
...
void compute ()
{
* int count = 10; // <= function scope
...
for (int count = 0; count < 20; ++count) { // <= block scope
std::cout << count << "\n";
...
}
std::cout << count << "\n";
}
This is perfectly legal in C++!
.explain-bottom[ The variable at function scope hides the variable of the same name at global scope
- at this point, any reference to
count
is assumed to refer to the function scope variable - the global scope version of
count
still exists, but is no longer accessible by that name ]
Variable shadowing (also called name hiding) occurs when two variables of the same name co-exist in different scopes
int count = 0; // <= global scope
...
void compute ()
{
* int count = 10; // <= function scope
...
* for (int count = 0; count < 20; ++count) { // <= block scope
std::cout << count << "\n";
...
}
std::cout << count << "\n";
}
This is perfectly legal in C++!
.explain-top[ Similarly, the variable at block scope hides the variable of the same name at function scope
- at this point, any reference to
count
is assumed to refer to the block scope variable - the function and global scope versions of
count
still exist, but are no longer accessible by that name ]
Variable shadowing (also called name hiding) occurs when two variables of the same name co-exist in different scopes
int count = 0; // <= global scope
...
void compute ()
{
int count = 10; // <= function scope
...
for (int count = 0; count < 20; ++count) { // <= block scope
* std::cout << count << "\n";
...
}
std::cout << count << "\n";
}
This is perfectly legal in C++!
.explain-top[
This line will print the block scope version of count
]
Variable shadowing (also called name hiding) occurs when two variables of the same name co-exist in different scopes
int count = 0; // <= global scope
...
void compute ()
{
int count = 10; // <= function scope
...
for (int count = 0; count < 20; ++count) { // <= block scope
std::cout << count << "\n";
...
}
* std::cout << count << "\n";
}
This is perfectly legal in C++!
.explain-top[
... but this line will print the function scope version of count
, since it
is outside the block
- the block scope version has been destroyed by that point ]
Variable shadowing allows a function to use local variables without worrying about whether some other global variable exists elsewhere with the same name
--
... but it can also lead to subtle errors!
--
Name collisions are a general problem in computing
- many projects will rely on functionality provided by other projects
- difficult to avoid using the same names across different projects!
--
Most of these problems can be avoided by:
- not using global variables in the first place! --
- using a dedicated C++ namespace
name: namespace
class: section
A namespace is essentially a named scope for your function and variable declarations
--
We have already been using a namespace from the start: the std
namespace
- this namespace is where all the functionality provided by the C++ standard is declared
--
.columns[ .col[
std::vector
- the
vector
class (a type) declared within thestd
namespace ] .col[
std::cout
- a variable (actually of type
std::ostream
) calledcout
declared within thestd
namespace ] ]
A namespace is essentially a named scope for your function and variable declarations
We have been using a namespace from the start: the std
namespace
- this namespace is where all the functionality provided by the C++ standard is declared
.columns[ .col[
`std`::vector
- the vector class (a type) declared within the
std
namespace ] .col[
`std`::cout
-
a variable (actually of type
std::ostream
) calledcout
declared within thestd
namespace ] ] -
these are both declared within the
std
namespace
A namespace is essentially a named scope for your function and variable declarations
We have been using a namespace from the start: the std
namespace
- this namespace is where all the functionality provided by the C++ standard is declared
.columns[ .col[
std`::`vector
- the vector class (a type) declared within the
std
namespace ] .col[
std`::`cout
-
a variable (actually of type
std::ostream
) calledcout
declared within thestd
namespace ] ] -
these are both declared within the
std
namespace -
accessing members of a namespace is done with the scope resolution operator (
::
)
The main purpose of C++ namespace is:
- to organise your code into logical units
- to avoid name collisions
--
For example, we may very well want to declare a variable or function called
count
- but there is already a C++ STL
count()
algorithm!
Thankfully, it is declared within the std
namespace
- we can declare our variable or function
count
without the risk of name collisions withstd::count()
!
Declaring variables or functions within a namespace is simply a matter of declaring them within the scope of our namespace.
--
For example:
namespace debug {
bool verbose = false;
}
This declares our variable verbose
to be within the debug
namespace
- this also implicitly declares the
debug
namespace if it hadn't already been encountered
Any code also declared within our debug
namespace can already access our
verbose
variable directly.
--
When accessing our variable outside of the debug
namespace, we can refer to
it using its fully qualified name:
if (debug::verbose)
...
--
Alternatively, we can selectively bring one identifier (variable or function)
into the current scope with the using
declaration:
using debug::verbose;
...
if (verbose)
...
We can also bring all identifiers declared in a namespace into the current scope with the using namespace
directive:
using namespace debug;
...
if (verbose)
...
--
The using namespace
directive can be problematic
- the general recommendation is to avoid its use
- especially at global scope
--
It is much better to use the fully qualified name, or if it really helps code
readability, use the selective using
declaration for the specific identifiers you
need
- and only use it at function or block scope
In spite of the general recommendation to avoid blanket using namespace
directives, you will find that many (if not most) C++ tutorials make use of
this directive from the very beginning, starting at 'hello world':
#include <iostream>
*using namespace std;
int main() {
cout << "Hello World!\n";
return 0;
}
--
This brings everything declared within the std
namespace into the global scope
- this helps to keep the code cleaner by avoiding the need for
std::
as a prefix forstd::cerr
,std::vector
,std::string
, ... - it makes it a bit easier when getting started with C++
So why do we not follow this convention on this course?
- because its use is considered bad practice --
- on this course, we try to teach best practice from the outset to avoid picking up bad habits
--
In practice, the using namespace
directive is strongly discouraged
- it bring everything from that namespace into the current scope
- including variables or functions you might not have known about!
- it negates the name collision benefits that C++ namespaces were meant to address!
--
It is particularly bad practice to make use of using namespace
directives
at global scope within header files
- it's acceptable to use
using namespace
within your owncpp
files- it would still not be considered good practice!
- but you can manage name collisions within your own project --
- but header files are supposed to be used by other code!
- any such global declaration will pollute the namespace of any code that
#include
s your header - it will affect the entire translation unit from the point your header is
#include
d
- any such global declaration will pollute the namespace of any code that
For the reasons outlined in the previous slides, we recommend to always use the fully qualified name to access members of a namespace
--
Coming back to our project, we can declare our verbose
variable within a new
debug
namespace to avoid name collisions
namespace debug {
bool verbose = false;
}
and refer to this variable by its fully qualified name: debug::verbose
--
But we still need to work out where to declare this variable
As previously mentioned, our debug::verbose
variable must be declared prior
to its point of use in any translation unit where it is needed
- this means it needs to be declared in a header file
--
Let's create a header file debug.h
to hold this variable:
#pragma once
namespace debug {
bool verbose = false;
}
--
We can now #include
it in shotgun.cpp
, fragments.h
and overlap.cpp
--
.explain-bottom[ Exercise: implement these changes in your code ]
When trying to do this, you will most likely have encountered errors of this nature:
/usr/bin/ld: fragments.o:(.bss+0x0): multiple definition of `debug::verbose';
shotgun.o:(.bss+0x0): first defined here
/usr/bin/ld: overlap.o:(.bss+0x0): multiple definition of `debug::verbose';
shotgun.o:(.bss+0x0): first defined here
collect2: error: ld returned 1 exit status
--
The issue is that we have declared and implicitly defined our
debug::verbose
variable once per translation unit
- we now have 3 versions of the same global variable, each independently listed in the
object files
shotgun.o
,fragments.o
andoverlap.o
-- - this causes an error at the linker stage, when the contents of all the compiled object files are brought together to form the final executable: which version is the right one to use?
--
There are 2 ways to get around this:
- ensure the variable is only declared in the header, and provide a single
definition in a separate
cpp
file - tell the compiler to allow multiple definitions of this variable
class: info
First approach:
Use the extern
keyword in
debug.h
to make it clear that we are only declaring our variable at this
point:
#pragma once
namespace debug {
`extern` bool verbose;
}
--
and place the full definition and initialisation in a new file debug.cpp
:
#include "debug.h"
namespace debug {
bool verbose = false;
}
name: inline_variable
Second approach (much simpler):
Use the inline
keyword in our debug.h
header file to allow multiple definitions
of the same variable to be provided across multiple translation units:
#pragma once
namespace debug {
`inline` bool verbose = false;
}
We do not need an additional cpp
file in this case, and the initial value can
now be specified at the same time
--
Note that this use of the inline
keyword was introduced in the C++17 version
of the standard
--
.explain-bottom[
Let's go ahead and use the inline
keyword in our project
]
We now have all the pieces required to implement our verbose option
- we add the
debug.h
header file as per the previous slide -- - in
main()
:- we use the
std::erase()
call to detect and handle the-v
option - and set
debug::verbose
totrue
if detected - we do this before any attempt at using the command-line arguments --
- we use the
- everywhere else:
#include "debug.h"
so thedebug::verbose
variable is declared at the point of use- check if
debug::verbose
istrue
, and if so print out appropriate information
--
For example, in fragments.cpp
:
...
std::vector<std::string> load_fragments (const std::string& filename)
{
* if (debug::verbose)
std::cerr << "reading fragments from file \"" << filename << "\"...\n";
...
--
.explain-bottomright[ Exercise: implement these changes in your own code ]
We can now run our code as normal:
$ ./shotgun ../data/fragments-1.txt out
initial sequence has size 1000
final sequence has length 20108
which provides a clean output, showing just as much information as we might expect when everything works correctly
... or we can run it with the verbose option to see exactly what is going on:
$ ./shotgun ../data/fragments-1.txt out `-v`
reading fragments from file "../data/fragments-1.txt"...
read 190 fragments
190 fragments, mean fragment length: 529.158, range [ 51 1000 ]
initial sequence has size 1000
---------------------------------------------------
189 fragments left
fragment with biggest overlap is at index 41, overlap = -824
---------------------------------------------------
...
fragment with biggest overlap is at index 37, overlap = 51
---------------------------------------------------
104 fragments left
104 fragments remaining unmatched - checking whether already contained in sequence...
final sequence has length 20108
writing sequence to file "out"...
The availability of a verbose option allows you and your users to quickly narrow down problems
-
the default output can be kept clean and minimal
-
when issues occur, running in verbose mode provide useful information:
- in the case of crashes, it provides a more precise indication of when the problem occurred (as indicated by the last message), and hence where to look in the code to figure out the problem
- in the case of unexpected output, the information reported may point out issues with the input data, or provide clues as to where to look for errors in the code
-
when there are performance issues, verbose mode can provide an indication of which stage is taking longer to complete than expected
-
...
--
In general, it is recommended to provide some mechanism to allow users to get detailed information about the processing, whether using a verbose option, or by way of a separate log file
Now that we have the option to provide debugging output, we can wrap up this
functionality more cleanly to avoid having lots of if (debug::verbose)
statements throughout our code
Let's implement a debug::log()
function, which will take a std::string
as
its argument, and print it to the standard error stream if verbose mode is
enabled
- it can additionally format its output to more clearly label it as debugging
information (for example, by prefixing each line with
[DEBUG]
)
--
.explain-bottom[ Exercise: implement this function and make use of it in your own code ]
In debug.h
:
#pragma once
*include <iostream>
*include <string>
namespace debug {
inline bool verbose = false;
* inline void log (const std::string& message) {
* if (verbose)
* std::cerr << "[DEBUG] " << message << "\n";
* }
}
name: inline_function
In debug.h
:
#pragma once
include <iostream>
include <string>
namespace debug {
inline bool verbose = false;
`inline` void log (const std::string& message) {
if (verbose)
std::cerr << "[DEBUG] " << message << "\n";
}
}
Note the use of the inline
keyword here, this time to allow multiple definitions of the same function
name: inline_keyword
Historically, the inline
keyword was considered more as a performance
optimisation feature
- it was used for functions only, to indicate that a given function was suitable for inlining
- this would suggest to the compiler that the required code could be substituted directly at the point of use, rather than calling that function explicitly
- There is always a small overhead to calling a function, and for functions that perform small operations and are called very often, this can impact performance
- declaring such functions as
inline
can help avoid the overhead of dedicated function calls, and so improve performance- this is only possible if the function definition is available, which is
why the
inline
mechanism was introduced to allow definitions to be included in header files without causing the multiple definition problem
- this is only possible if the function definition is available, which is
why the
- For further details, refer to other online sources (e.g. GeeksForGeeks)
--
In modern C++ (particularly since C++17), the inline
keyword is used to allow multiple definitions of
functions or variables across multiple translation units
- although it does allow the compiler to perform the kinds of optimisation mentioned above, the compiler is free to treat this merely as a hint
We can now use our convenience function in our code, for example in fragments.cpp
:
...
std::vector<std::string> load_fragments (const std::string& filename)
{
* if (debug::verbose)
* std::cerr << "reading fragments from file \"" << filename << "\"...\n";
...
becomes simply:
...
std::vector<std::string> load_fragments (const std::string& filename)
{
* debug::log ("reading fragments from file \"" + filename + "\"...");
...
--
.explain-top[
Note the use of the +
string concatenation operator instead of the stream
insertion operator (<<
)
This is because our `debug::log()` function expects a single `std::string` argument ]
How do we handle this case?
if (debug::verbose)
std::cerr << "read " << fragments.size() << " fragments\n";
--
We can't write:
debug::log ("read " + fragments.size() + " fragments");
since fragment.size()
returns an int
, and we can't concatenate a
std::string
with an int
!
--
Use std::to_string()
, which returns a std::string
representation of a
variable:
debug::log ("read " + std::to_string (fragments.size()) + " fragments");
-- .explain-bottom[ Exercise: implement these changes in your own code ]
class: section name: exercises