Skip to content

Latest commit

 

History

History
2476 lines (1812 loc) · 58.5 KB

week2B.md

File metadata and controls

2476 lines (1812 loc) · 58.5 KB
layout title
presentation
Week 2, session 2: more string handling, basic debugging

class: title

5CCYB041

OBJECT-ORIENTED PROGRAMMING

Week 2, session 2

More string handling
Basic debugging


Picking up where we left off

We continue working on our DNA shotgun sequencing project

You can find the most up to date version in the project's solution/ folder

.explain-bottom[ Make sure your code is up to date now! ]


layout: true

Computing the overlap

Let's imagine that this is the current estimate of the sequence:

CTGAATGCTTGGGCTGAAAGGGCGCGAGACGTATTCCCCGGTTGCAGACG

and this is a candidate fragment:

CCCTCATCACAACTGAATGCTTGGGCTGAA

We could compute the overlap by comparing the overlapping region, starting from the smallest overlap:



                             `C`TGAATGCTTGGGCTGAAAGGGCGCGAGACGTATTCCCCGGTTGC
CCCTCATCACAACTGAATGCTTGGGCTGA`A`

                             `CT`GAATGCTTGGGCTGAAAGGGCGCGAGACGTATTCCCCGGTTGC
 CCCTCATCACAACTGAATGCTTGGGCTG`AA`

                             `CTG`AATGCTTGGGCTGAAAGGGCGCGAGACGTATTCCCCGGTTGC
  CCCTCATCACAACTGAATGCTTGGGCT`GAA`

                             `CTGA`ATGCTTGGGCTGAAAGGGCGCGAGACGTATTCCCCGGTTGC
   CCCTCATCACAACTGAATGCTTGGGC`TGAA`

                             `CTGAA`TGCTTGGGCTGAAAGGGCGCGAGACGTATTCCCCGGTTGC
    CCCTCATCACAACTGAATGCTTGGG`CTGAA`

We have a match with an overlap of 5 bases!

  • should we stop now?

                             `CTGAATGCTTGGGCTGAA`AGGGCGCGAGACGTATTCCCCGGTTGC
                 CCCTCATCACAA`CTGAATGCTTGGGCTGAA`

No! If we carry on, we'll find a larger overlap


layout: true

Computing the overlap

A better strategy is to start with the biggest overlap first, and gradually reduce it

  • that way, if we do find a match, we can stop immediately


                             `CTGAATGCTTGGGCTGAAAGGGCGCGAGAC`GTATTCCCCGGTTGC
                             `CCCTCATCACAACTGAATGCTTGGGCTGAA`

                             `CTGAATGCTTGGGCTGAAAGGGCGCGAGA`CGTATTCCCCGGTTGC
                            C`CCTCATCACAACTGAATGCTTGGGCTGAA`

                             `CTGAATGCTTGGGCTGAAAGGGCGCGAG`ACGTATTCCCCGGTTGC
                           CC`CTCATCACAACTGAATGCTTGGGCTGAA`

                             `CTGAATGCTTGGGCTGAAAGGGCGCGA`GACGTATTCCCCGGTTGC
                          CCC`TCATCACAACTGAATGCTTGGGCTGAA`

                             `CTGAATGCTTGGGCTGAAAGGGCGCG`AGACGTATTCCCCGGTTGC
                         CCCT`CATCACAACTGAATGCTTGGGCTGAA`

                             `CTGAATGCTTGGGCTGAAAGGGCGC`GAGACGTATTCCCCGGTTGC
                        CCCTC`ATCACAACTGAATGCTTGGGCTGAA`

                             `CTGAATGCTTGGGCTGAAAGGGCG`CGAGACGTATTCCCCGGTTGC
                       CCCTCA`TCACAACTGAATGCTTGGGCTGAA`

                             `CTGAATGCTTGGGCTGAAAGGGC`GCGAGACGTATTCCCCGGTTGC
                      CCCTCAT`CACAACTGAATGCTTGGGCTGAA`

                             `CTGAATGCTTGGGCTGAAAGGG`CGCGAGACGTATTCCCCGGTTGC
                     CCCTCATC`ACAACTGAATGCTTGGGCTGAA`

                             `CTGAATGCTTGGGCTGAAAGG`GCGCGAGACGTATTCCCCGGTTGC
                    CCCTCATCA`CAACTGAATGCTTGGGCTGAA`

                             `CTGAATGCTTGGGCTGAAAG`GGCGCGAGACGTATTCCCCGGTTGC
                   CCCTCATCAC`AACTGAATGCTTGGGCTGAA`

                             `CTGAATGCTTGGGCTGAAA`GGGCGCGAGACGTATTCCCCGGTTGC
                  CCCTCATCACA`ACTGAATGCTTGGGCTGAA`

                             `CTGAATGCTTGGGCTGAA`AGGGCGCGAGACGTATTCCCCGGTTGC
                 CCCTCATCACAA`CTGAATGCTTGGGCTGAA`

Now as soon as we find an overlap, it is guaranteed to be the largest


layout: false

Exercises

  • Add a function to compute the overlap between the current sequence and a candidate fragment

    • make sure it works for both ends of the string
  • Use this function to identify the candidate fragment with the largest overlap with current sequence

  • Add a function to merge this candidate fragment with the current sequence, given the computed overlap

  • Use these functions to iteratively merge candidates fragments until no overlapping fragments remain

  • Check that all unmerged fragments are already contained within the sequence

--

.explain-bottom[ Exercise: implement code to do this ]


Computing the overlap

We create a header file overlap.h to hold our function declaration

#pragma once

#include <string>

int compute_overlap (const std::string& sequence, const std::string& fragment);

Computing the overlap

We create a header file overlap.h to hold our function declaration

#pragma once

#include <string>

`int` compute_overlap (const std::string& sequence, const std::string& fragment);
  • we return an integer representing the number of bases in the overlap

Computing the overlap

We create a header file overlap.h to hold our function declaration

#pragma once

#include <string>

int `compute_overlap` (const std::string& sequence, const std::string& fragment);
  • we return an integer representing the number of bases in the overlap
  • we chose a suitable name for our function that clearly states its purpose

Computing the overlap

We create a header file overlap.h to hold our function declaration

#pragma once

#include <string>

int compute_overlap (`const std::string& sequence, const std::string& fragment`);
  • we return an integer representing the number of bases in the overlap
  • we chose a suitable name for our function that clearly states its purpose
  • we pass both the sequence and candidate fragment as const references to strings

Computing the overlap

We create a header file overlap.h to hold our function declaration

#pragma once

*#include <string>

int compute_overlap (const std::string& sequence, const std::string& fragment);
  • we return an integer representing the number of bases in the overlap
  • we chose a suitable name for our function that clearly states its purpose
  • we pass both the sequence and candidate fragment as const references to strings
  • we need to #include the <string> header since we are using std::string

Computing the overlap

We create a header file overlap.h to hold our function declaration

*#pragma once

#include <string>

int compute_overlap (const std::string& sequence, const std::string& fragment);
  • we return an integer representing the number of bases in the overlap
  • we chose a suitable name for our function that clearly states its purpose
  • we pass both the sequence and candidate fragment as const references to strings
  • we need to #include the <string> header since we are using std::string
  • we use the #pragma once directive to avoid double-inclusion

Computing the overlap

Next, we create the overlap.cpp file to hold our function definition

#include <iostream>
#include <stdexcept>
#include <string>

#include "overlap.h"

int compute_overlap (const std::string& sequence, const std::string& fragment)
{
  // this is where our code will go
}

Computing the overlap

Next, we create the overlap.cpp file to hold our function definition

#include <iostream>
#include <stdexcept>
#include <string>

#include "overlap.h"

*int compute_overlap (const std::string& sequence, const std::string& fragment)
{
  // this is where our code will go
}
  • we need to replicate the function declaration, making sure it matches exactly

Computing the overlap

Next, we create the overlap.cpp file to hold our function definition

#include <iostream>
#include <stdexcept>
#include <string>

*#include "overlap.h"

int compute_overlap (const std::string& sequence, const std::string& fragment)
{
  // this is where our code will go
}
  • we need to replicate the function declaration, making sure it matches exactly
  • we need to #include the matching header file

Computing the overlap

Next, we create the overlap.cpp file to hold our function definition

*#include <iostream>
*#include <stdexcept>
*#include <string>

#include "overlap.h"

int compute_overlap (const std::string& sequence, const std::string& fragment)
{
  // this is where our code will go
}
  • we need to replicate the function declaration, making sure it matches exactly
  • we need to #include the matching header file
  • we need to #include the system headers for the functionality we will be using

Computing the overlap

Next, we create the overlap.cpp file to hold our function definition

#include <iostream>
#include <stdexcept>
#include <string>

#include "overlap.h"

int compute_overlap (const std::string& sequence, const std::string& fragment)
{
* // this is where our code will go
}
  • we need to replicate the function declaration, making sure it matches exactly
  • we need to #include the matching header file
  • we need to #include the system headers for the functionality we will be using
  • now we can worry about what will go in the body of our function

Computing the overlap

Now we fill out the body of our compute_overlap() function

int compute_overlap (const std::string& sequence, const std::string& fragment)
{
  if (fragment.size() > sequence.size())
    throw std::runtime_error ("fragment size larger than current sequence!");

  std::cerr << "sequence is \"" << sequence << "\"\n";
  std::cerr << "trying fragment \"" << fragment << "\"\n";

  ...

  return 0;
}

Computing the overlap

Now we fill out the body of our compute_overlap() function

int compute_overlap (const std::string& sequence, const std::string& fragment)
{
* if (fragment.size() > sequence.size())
*   throw std::runtime_error ("fragment size larger than current sequence!");

  std::cerr << "sequence is \"" << sequence << "\"\n";
  std::cerr << "trying fragment \"" << fragment << "\"\n";

  ...

  return 0;
}
  • always check any assumptions and expectations upfront – this can save you a lot of frustration later...

Computing the overlap

Now we fill out the body of our compute_overlap() function

int compute_overlap (const std::string& sequence, const std::string& fragment)
{
  if (fragment.size() > sequence.size())
    throw std::runtime_error ("fragment size larger than current sequence!");

* std::cerr << "sequence is \"" << sequence << "\"\n";
* std::cerr << "trying fragment \"" << fragment << "\"\n";

  ...

  return 0;
}
  • always check any assumptions and expectations upfront – this can save you a lot of frustration later...
  • during the development phase, don't hesitate to display lots of information, it will often alert you to where things might be going wrong.

Computing the overlap

Now we fill out the body of our compute_overlap() function

int compute_overlap (const std::string& sequence, const std::string& fragment)
{
  if (fragment.size() > sequence.size())
    throw std::runtime_error ("fragment size larger than current sequence!");

  std::cerr << "sequence is \"" << sequence << "\"\n";
  std::cerr << "trying fragment \"" << fragment << "\"\n";

  ...

* return 0;
}
  • always check any assumptions and expectations upfront – this can save you a lot of frustration later...
  • during the development phase, don't hesitate to display lots of information, it will often alert you to where things might be going wrong.
  • we need to return an integer, so let's return a value of zero (no overlap) by default

Computing the overlap

Now we fill out the body of our compute_overlap() function

int compute_overlap (const std::string& sequence, const std::string& fragment)
{
  if (fragment.size() > sequence.size())
    throw std::runtime_error ("fragment size larger than current sequence!");

  std::cerr << "sequence is \"" << sequence << "\"\n";
  std::cerr << "trying fragment \"" << fragment << "\"\n";

* ...

  return 0;
}
  • always check any assumptions and expectations upfront – this can save you a lot of frustration later...
  • during the development phase, don't hesitate to display lots of information, it will often alert you to where things might be going wrong.
  • we need to return an integer, so let's return a value of zero (no overlap) by default

.explain-bottom[ Now we can focus on what happens here... ]


Computing the overlap

  for (int overlap = fragment.size(); overlap > 0; --overlap) {
    const auto seq_start = sequence.substr(0, overlap);
    const auto frag_end = fragment.substr(fragment.size()-overlap);
    if (seq_start == frag_end)
      return overlap;
  }

Computing the overlap

* for (int overlap = fragment.size(); overlap > 0; --overlap) {
    const auto seq_start = sequence.substr(0, overlap);
    const auto frag_end = fragment.substr(fragment.size()-overlap);
    if (seq_start == frag_end)
      return overlap;
* }

We iterate over the range of valid overlaps:

  • starting from the largest possible overlap (the size of the candidate fragment)
  • reducing the size of the overlap at each iteration
  • until the overlap is zero

Computing the overlap

  for (int overlap = fragment.size(); overlap > 0; --overlap) {
*   const auto seq_start = sequence.substr(0, overlap);
    const auto frag_end = fragment.substr(fragment.size()-overlap);
    if (seq_start == frag_end)
      return overlap;
  }

We iterate over the range of valid overlaps:

  • starting from the largest possible overlap (the size of the candidate fragment)
  • reducing the size of the overlap at each iteration
  • until the overlap is zero

We extract the part of the current sequence that corresponds to that overlap


Computing the overlap

  for (int overlap = fragment.size(); overlap > 0; --overlap) {
    const auto seq_start = sequence.substr(0, overlap);
*   const auto frag_end = fragment.substr(fragment.size()-overlap);
    if (seq_start == frag_end)
      return overlap;
  }

We iterate over the range of valid overlaps:

  • starting from the largest possible overlap (the size of the candidate fragment)
  • reducing the size of the overlap at each iteration
  • until the overlap is zero

We extract the part of the current sequence that corresponds to that overlap

  • to do this, we use the string substr() method
  • and likewise for the candidate sequence – though it is now the end of the string

Computing the overlap

  for (int overlap = fragment.size(); overlap > 0; --overlap) {
    const auto seq_start = sequence.substr(0, overlap);
    const auto frag_end = fragment.substr(fragment.size()-overlap);
*   if (seq_start == frag_end)
*     return overlap;
  }

We iterate over the range of valid overlaps:

  • starting from the largest possible overlap (the size of the candidate fragment)
  • reducing the size of the overlap at each iteration
  • until the overlap is zero

We extract the part of the current sequence that corresponds to that overlap

  • to do this, we use the string substr() method
  • and likewise for the candidate sequence – though it is now the end of the string

Finally, we check whether both substrings match

  • If so, we can immediately return the size of the current overlap

Computing the overlap

  for (int overlap = fragment.size(); overlap > 0; --overlap) {
    const auto seq_start = sequence.substr(0, overlap);
    const auto frag_end = `fragment.substr(fragment.size()-overlap)`;
    if (seq_start == frag_end)
      return overlap;
  }

We iterate over the range of valid overlaps:

  • starting from the largest possible overlap (the size of the candidate fragment)
  • reducing the size of the overlap at each iteration
  • until the overlap is zero

We extract the part of the current sequence that corresponds to that overlap

  • to do this, we use the string substr() method
  • and likewise for the candidate sequence – though it is now the end of the string

Finally, we check whether both substrings match

  • If so, we can immediately return the size of the current overlap

.explain-bottom[ Note that in this call to substr(), we only provided one argument, whereas we had provided two arguments in the previous call (position and length of the substring).


This is because the arguments for this call have been given *default values*: - if unspecified, the position is zero, and the length is the maximum value possible. - we will cover [default arguments in C++](https://www.geeksforgeeks.org/default-arguments-c/) later in the course. ]

Computing the overlap

Here is our function definition in full:

int compute_overlap (const std::string& sequence, const std::string& fragment)
{
  if (fragment.size() > sequence.size())
    throw std::runtime_error ("fragment size larger than current sequence!");

  std::cerr << "sequence is \"" << sequence << "\"\n";
  std::cerr << "trying fragment \"" << fragment << "\"\n";

  for (int overlap = fragment.size(); overlap > 0; --overlap) {
    const auto seq_start = sequence.substr(0, overlap);
    const auto frag_end = fragment.substr(fragment.size()-overlap);
    if (seq_start == frag_end)
      return overlap;
  }

  return 0;
}

Exercises

  • Add a function to compute the overlap between the current sequence and a candidate fragment

    • make sure it works for both ends of the string
  • Use this function to identify the candidate fragment with the largest overlap with current sequence

  • Add a function to merge this candidate fragment with the current sequence, given the computed overlap

  • Use these functions to iteratively merge candidates fragments until no overlapping fragments remain

  • Check that all unmerged fragments are already contained within the sequence


Computing the overlap

int compute_overlap (const std::string& sequence, const std::string& fragment)
{
  if (fragment.size() > sequence.size())
    throw std::runtime_error ("fragment size larger than current sequence!");

  std::cerr << "sequence is \"" << sequence << "\"\n";
  std::cerr << "trying fragment \"" << fragment << "\"\n";

  for (int overlap = fragment.size(); overlap > 0; --overlap) {
    const auto seq_start = sequence.substr(0, overlap);
    const auto frag_end = fragment.substr(fragment.size()-overlap);
    if (seq_start == frag_end)
      return overlap;
  }

  return 0;
}

.explain-bottom[ We need to modify our code to also check for overlap at the other end of the sequence ]


Computing the overlap

int compute_overlap (const std::string& sequence, const std::string& fragment)
{
  if (fragment.size() > sequence.size())
    throw std::runtime_error ("fragment size larger than current sequence!");

  std::cerr << "sequence is \"" << sequence << "\"\n";
  std::cerr << "trying fragment \"" << fragment << "\"\n";

  for (int overlap = fragment.size(); overlap > 0; --overlap) {
    const auto seq_start = sequence.substr(0, overlap);
    const auto frag_end = fragment.substr(fragment.size()-overlap);
    if (seq_start == frag_end)
*     return overlap;
  }

  return 0;
}

.explain-bottom[ We can no longer return the overlap here, since there may be a larger overlap at the other end.


We need to check both ends, and return the larger of the two. ]

Computing the overlap

int compute_overlap (const std::string& sequence, const std::string& fragment)
{
  if (fragment.size() > sequence.size())
    throw std::runtime_error ("fragment size larger than current sequence!");

  std::cerr << "sequence is \"" << sequence << "\"\n";
  std::cerr << "trying fragment \"" << fragment << "\"\n";

* int largest_overlap = 0;
  for (int overlap = fragment.size(); overlap > 0; --overlap) {
    const auto seq_start = sequence.substr(0, overlap);
    const auto frag_end = fragment.substr(fragment.size()-overlap);
    if (seq_start == frag_end) {
      largest_overlap = overlap;
      break;
    }
  }

  ...

.explain-top[ To do this, we declare an integer variable to hold the value of the largest overlap detected so far ]


Computing the overlap

int compute_overlap (const std::string& sequence, const std::string& fragment)
{
  if (fragment.size() > sequence.size())
    throw std::runtime_error ("fragment size larger than current sequence!");

  std::cerr << "sequence is \"" << sequence << "\"\n";
  std::cerr << "trying fragment \"" << fragment << "\"\n";

  int largest_overlap = 0;
  for (int overlap = fragment.size(); overlap > 0; --overlap) {
    const auto seq_start = sequence.substr(0, overlap);
    const auto frag_end = fragment.substr(fragment.size()-overlap);
*   if (seq_start == frag_end) {
*     largest_overlap = overlap;
*     break;
*   }
  }

  ...

.explain-top[ Now, instead of returning the value of the current overlap, we record this value, and use the break statement to stop the enclosing loop and proceed with the following code. ]


Computing the overlap

    ...
    if (seq_start == frag_end) {
      largest_overlap = overlap;
      break;
    }
  }

  for (int overlap = fragment.size(); overlap > largest_overlap; --overlap) {
    const auto seq_end = sequence.substr(sequence.size() - overlap);
    const auto frag_start = fragment.substr(0, overlap);
    if (seq_end == frag_start) {
      largest_overlap = -overlap;
      break;
    }
  }

  return largest_overlap;
}

--

.explain-top[ We now check the other end of the string, using the same idea as before – with a few differences ]


Computing the overlap

    ...
    if (seq_start == frag_end) {
      largest_overlap = overlap;
      break;
    }
  }

  for (int overlap = fragment.size(); overlap > largest_overlap; --overlap) {
*   const auto seq_end = sequence.substr(sequence.size() - overlap);
*   const auto frag_start = fragment.substr(0, overlap);
    if (seq_end == frag_start) {
      largest_overlap = -overlap;
      break;
    }
  }

  return largest_overlap;
}

.explain-top[ This time, we extract the substrings from the opposite ends of both the current sequence and the candidate fragment ]


Computing the overlap

    ...
    if (seq_start == frag_end) {
      largest_overlap = overlap;
      break;
    }
  }

  for (int overlap = fragment.size(); overlap > `largest_overlap`; --overlap) {
    const auto seq_end = sequence.substr(sequence.size() - overlap);
    const auto frag_start = fragment.substr(0, overlap);
    if (seq_end == frag_start) {
      largest_overlap = -overlap;
      break;
    }
  }

  return largest_overlap;
}

.explain-top[ There is no point iterating all the way to zero, since we already know the size of the largest overlap from the previous loop

  • there is no point checking smaller overlaps than that! ]

Computing the overlap

    ...
    if (seq_start == frag_end) {
      largest_overlap = overlap;
      break;
    }
  }

  for (int overlap = fragment.size(); overlap > largest_overlap; --overlap) {
    const auto seq_end = sequence.substr(sequence.size() - overlap);
    const auto frag_start = fragment.substr(0, overlap);
    if (seq_end == frag_start) {
*     largest_overlap = -overlap;
      break;
    }
  }

  return largest_overlap;
}

.explain-top[ ... and we record the overlap as a negative value.

  • this will be our way of representing an overlap at the tail end of the sequence
  • this is guaranteed to be the largest overlap so far, since in this for loop, we are only considering overlaps larger than encountered at the other end of the sequence ]

Computing the overlap

    ...
    if (seq_start == frag_end) {
      largest_overlap = overlap;
      break;
    }
  }

  for (int overlap = fragment.size(); overlap > largest_overlap; --overlap) {
    const auto seq_end = sequence.substr(sequence.size() - overlap);
    const auto frag_start = fragment.substr(0, overlap);
    if (seq_end == frag_start) {
      largest_overlap = -overlap;
      break;
    }
  }

* return largest_overlap;
}

.explain-top[ Finally, we end by returning the largest overlap recorded in both loops ]


Exercises

  • Add a function to compute the overlap between the current sequence and a candidate fragment

    • make sure it works for both ends of the string
  • Use this function to identify the candidate fragment with the largest overlap with current sequence

  • Add a function to merge this candidate fragment with the current sequence, given the computed overlap

  • Use these functions to iteratively merge candidates fragments until no overlapping fragments remain

  • Check that all unmerged fragments are already contained within the sequence


Identifying fragment with the largest overlap

  ...

  int biggest_overlap = 0;
  int fragment_with_biggest_overlap = -1;
  for (int n = 0; n < fragments.size(); ++n) {
    const auto overlap = compute_overlap (sequence, fragments[n]);
    if (std::abs (biggest_overlap) < std::abs (overlap)) {
      biggest_overlap = overlap;
      fragment_with_biggest_overlap = n;
    }
  }

  if (fragment_with_biggest_overlap >= 0)
    std::cerr << "fragment with biggest overlap is at index " 
      << fragment_with_biggest_overlap
      << ", overlap = " << biggest_overlap << "\n";

  write_sequence (args[2], sequence);
}


Identifying fragment with the largest overlap

  ...

  int biggest_overlap = 0;
  int fragment_with_biggest_overlap = -1;
* for (int n = 0; n < fragments.size(); ++n) {
    const auto overlap = compute_overlap (sequence, fragments[n]);
    if (std::abs (biggest_overlap) < std::abs (overlap)) {
      biggest_overlap = overlap;
      fragment_with_biggest_overlap = n;
    }
* }

  if (fragment_with_biggest_overlap >= 0)
    std::cerr << "fragment with biggest overlap is at index " 
      << fragment_with_biggest_overlap
      << ", overlap = " << biggest_overlap << "\n";

  write_sequence (args[2], sequence);
}

.explain-bottom[ We iterate over all candidate fragments

  • we use a regular for loop (rather than a range-based for loop) since we want to keep track of the index of the fragment ]

Identifying fragment with the largest overlap

  ...

  int biggest_overlap = 0;
  int fragment_with_biggest_overlap = -1;
  for (int n = 0; n < fragments.size(); ++n) {
*   const auto overlap = compute_overlap (sequence, fragments[n]);
    if (std::abs (biggest_overlap) < std::abs (overlap)) {
      biggest_overlap = overlap;
      fragment_with_biggest_overlap = n;
    }
  }

  if (fragment_with_biggest_overlap >= 0)
    std::cerr << "fragment with biggest overlap is at index " 
      << fragment_with_biggest_overlap
      << ", overlap = " << biggest_overlap << "\n";

  write_sequence (args[2], sequence);
}

.explain-bottom[ We use our new function to compute the overlap for the current candidate fragment

  • we store the result in a block scope variable
  • we declare this variable const since we have no intention of modifying it any further
    • this is good practice as we can rely on the compiler to check that we do not trip ourselves up!
  • we allow the compiler to deduce the type using the auto keyword
    • the compiler will assume we want to use the type returned by compute_overlap()
    • we could just easily have declared it as int... ]

Identifying fragment with the largest overlap

  ...

* int biggest_overlap = 0;
  int fragment_with_biggest_overlap = -1;
  for (int n = 0; n < fragments.size(); ++n) {
    const auto overlap = compute_overlap (sequence, fragments[n]);
*   if (std::abs (biggest_overlap) < std::abs (overlap)) {
      biggest_overlap = overlap;
      fragment_with_biggest_overlap = n;
    }
  }

  if (fragment_with_biggest_overlap >= 0)
    std::cerr << "fragment with biggest overlap is at index " 
      << fragment_with_biggest_overlap
      << ", overlap = " << biggest_overlap << "\n";

  write_sequence (args[2], sequence);
}

.explain-bottom[ We can then compare the extracted portions using the == comparison operator


If they match, we can immediately return the value of the overlap ]

.explain-bottom[ We check whether the overlap for the current fragment exceeds the biggest overlap we have detected so far

  • since we have allowed overlaps to be negative, we need to take the absolute value of the overlaps prior to comparison
  • we also need to make sure we have declared a variable to hold the value of the biggest overlap detected so far, and initialise it with the smallest value possible ]

Identifying fragment with the largest overlap

  ...

  int biggest_overlap = 0;
* int fragment_with_biggest_overlap = -1;
  for (int n = 0; n < fragments.size(); ++n) {
    const auto overlap = compute_overlap (sequence, fragments[n]);
    if (std::abs (biggest_overlap) < std::abs (overlap)) {
*     biggest_overlap = overlap;
*     fragment_with_biggest_overlap = n;
    }
  }

  if (fragment_with_biggest_overlap >= 0)
    std::cerr << "fragment with biggest overlap is at index " 
      << fragment_with_biggest_overlap
      << ", overlap = " << biggest_overlap << "\n";

  write_sequence (args[2], sequence);
}

.explain-bottom[ If we make through to the last iteration, there is no match, so we simply return 0; ]

.explain-bottom[ If the overlap is the biggest so far, record its value, and the index of the corresponding fragment

  • to do this, we also need to declare a variable to hold the index of the corresponding fragment
  • we also need to initialise the value of this index. In this case, it's best to assign it an invalid value (for example, a negative index) ]

Exercises

  • Add a function to compute the overlap between the current sequence and a candidate fragment

    • make sure it works for both ends of the string
  • Use this function to identify the candidate fragment with the largest overlap with current sequence

  • Add a function to merge this candidate fragment with the current sequence, given the computed overlap

  • Use these functions to iteratively merge candidates fragments until no overlapping fragments remain

  • Check that all unmerged fragments are already contained within the sequence


Merge fragment with current sequence

Once we've identified the candidate fragment with the largest overlap, we need to merge it with the current sequence to produce a longer, updated sequence:

--

                             `CTGAATGCTTGGGCTGAA`AGGGCGCGAGACGTATTCCCCGGTTGC
                 CCCTCATCACAA`CTGAATGCTTGGGCTGAA`

--

.center[ ⇓ ]

                 CCCTCATCACAA`CTGAATGCTTGGGCTGAA`AGGGCGCGAGACGTATTCCCCGGTTGC

--


Let's implement a function to do this. This is what our function *declaration* should look like. We can add it to `overlap.h`: ``` void merge (std::string& sequence, const std::string& fragment, const int overlap); ```

Merge fragment with current sequence

Next, add our function definition to overlap.cpp:

void merge (std::string& sequence, const std::string& fragment, const int overlap)
{
  if (overlap < 0) {
    sequence += fragment.substr (-overlap);
  }
  else if (overlap > 0) {
    sequence = fragment.substr (0, fragment.size()-overlap) + sequence;
  }
}

-- .explain-bottom[

  • we concatenate the non-overlapping portion of the fragment with the sequence
  • the sign of overlap indicates which end the overlap corresponds to
    • hence whether to append or prepend the fragment to the sequence ]

Merge fragment with current sequence

Use merge() function in main():

  ...

  if (fragment_with_biggest_overlap >= 0) {
    std::cerr << "fragment with biggest overlap is at index " 
      << fragment_with_biggest_overlap
      << ", overlap = " << biggest_overlap << "\n";

*   merge (sequence, fragments[fragment_with_biggest_overlap], biggest_overlap);
  }

  std::cerr << "final sequence has length " << sequence.size() << "\n";
  write_sequence (args[2], sequence);
}

Exercises

  • Add a function to compute the overlap between the current sequence and a candidate fragment

    • make sure it works for both ends of the string
  • Use this function to identify the candidate fragment with the largest overlap with current sequence

  • Add a function to merge this candidate fragment with the current sequence, given the computed overlap

  • Use these functions to iteratively merge candidates fragments until no overlapping fragments remain

  • Check that all unmerged fragments are already contained within the sequence


Iteratively merge fragments into sequence

* while (fragments.size()) {
*   std::cerr << "---------------------------------------------------\n";
*   std::cerr << fragments.size() << " fragments left\n";
    int biggest_overlap = 0;
    int fragment_with_biggest_overlap = -1;
    for (int n = 0; n < static_cast<int> (fragments.size()); ++n) {
      ...
    }
*    if (fragment_with_biggest_overlap < 0)
*      break;
*    if (std::abs (biggest_overlap) < 10)
*      break;

    std::cerr << "fragment with biggest overlap is at index " 
              << fragment_with_biggest_overlap
              << ", overlap = " << biggest_overlap << "\n";

    merge (sequence, fragments[fragment_with_biggest_overlap], biggest_overlap);
*   fragments.erase (fragments.begin() + fragment_with_biggest_overlap);
* }

Exercises

  • Add a function to compute the overlap between the current sequence and a candidate fragment

    • make sure it works for both ends of the string
  • Use this function to identify the candidate fragment with the largest overlap with current sequence

  • Add a function to merge this candidate fragment with the current sequence, given the computed overlap

  • Use these functions to iteratively merge candidates fragments until no overlapping fragments remain

  • Check that all unmerged fragments are already contained within the sequence


Check all unmerged fragments are already in sequence

    ...
    fragments.erase (fragments.begin() + fragment_with_biggest_overlap);
  }

* int num_unmatched = 0;
* for (const auto& frag : fragments) {
*   if (sequence.find (frag) == std::string::npos)
*     ++num_unmatched;
* }

* if (num_unmatched)
*   std::cerr << "WARNING: " << num_unmatched << " fragments remain unmatched!\n";

  std::cerr << "final sequence has length " << sequence.size() << "\n";
  write_sequence (args[2], sequence);
}

name: cmdline_options

class: section

Command-line options

Requesting additional debugging information


Command-line options

We've already seen examples of command-line options in other commands:

$ ls `-l`
$ g++ `-std=c++20` shotgun.cpp `-o` shotgun

--

The purpose of a command-line option is typically to alter the behaviour of the command in some way

--

Can we use that to request additional debugging information?

  • it is a common practice for commands to provide an option to enable verbose mode
  • this is really useful to figure out exactly what is going wrong when running your program!

--

A typical option to enable verbose mode is -v

  • let's try to implement this in our project

Command-line options

There are many ways to handle command-line options

--

Have a go at implementing this simple approach to detect the -v option:

  • start by checking whether the option has been provided in the arguments list
  • if it has, then:
    • take whatever action is appropriate (e.g. set a verbose variable to true)
    • remove that option from the argument list
  • process your arguments as previously

--

.explain-bottom[ Hint: there are many ways of doing this, but you may find the new C++20 std::erase() function very useful here (if you can work out how to use it from the documentation...) ]


Detecting the -v option

Possible solution:

...

void run (std::vector<std::string>& args)
{
* bool verbose = std::erase (args, "-v");

  if (args.size() < 3)
    throw std::runtime_error ("expected 2 arguments: input_fragments output_sequence");

  ...
}

--

std::erase() will erase any occurence of the value ("-v") from the container (args), and return the number of occurences removed.

  • if no -v in the argument list, the return value is zero, which equates to false when assigned to a bool
  • if there were any occurences of -v, a non-zero value will be returned, which equates to true

Global variables

One problem with the previous solution is that verbose is a local variable

  • it is only accessible within the run() function

--

If we call other functions, the code in these functions will have no knowledge or access to our verbose variable

--

We could pass our verbose variable as an argument to all relevant functions

  • but this very rapidly becomes cumbersome and impractical

--

A better option in this particular case is to use a global variable


Global variables

A global variable is declared outside of any function and is accessible from anywhere in the code

  • provided its declaration is encountered before the point of use within each translation unit

--

In general, global variables are strongly discouraged

  • they introduce scope for state changes that may affect how different part of the program operate, and make it difficult to ensure predictable behaviour (see e.g. this article for details)

--

There are however cases where they make sense:

  • for constants (in which case they would also be declared const) --
  • when they represent a truly unique entity
    • examples include std::cout, std::cerr: there can only be one of each in any program
    • in our case, there can only ever be one setting representing verbose mode!

Global variables

We could declare our global variable like this:

...

*bool verbose = false;

...

void run (std::vector<std::string>& args)
{
* verbose = std::erase (args, "-v");

  ...

But there are still issues with this:

  • this is declared within our shotgun.cpp file – it won't be accessible in our other cpp files
    • they won't be in the same translation unit --
  • what happens if another function declares their own local variable called verbose?

name: scope

Variable scope

The scope of a variable determines its lifetime and where it can be accessed from.

Global scope

  • Variables are instantiated (created) before main() starts, and destroyed after main() returns
  • they are accessible from any part of the code
    • provided their declaration occurs before their point of use in the current translation unit

--

Function scope

  • Variables are instantiated when declared, and destroyed when the function returns
  • They are only accessible within their enclosing function, after their declaration
  • Each invocation of a given function will have its own version of its local variables

--

Block scope

  • Block scope means within a compound statement (e.g. within a for loop or if statement)
  • Variables are instantiated when declared, and destroyed at the end of the block
  • They are only accessible within their block, after their declaration

name: shadowing

Variable shadowing

Variable shadowing (also called name hiding) occurs when two variables of the same name co-exist in different scopes

int count = 0;   // <= global scope

...

void compute ()
{
  int count = 10;   // <= function scope
  ...

  for (int count = 0; count < 20; ++count) {    // <= block scope
    std::cout << count << "\n";
    ...
  }
  std::cout << count << "\n";
}

--

This is perfectly legal in C++!


Variable shadowing

Variable shadowing (also called name hiding) occurs when two variables of the same name co-exist in different scopes

*int count = 0;   // <= global scope

...

void compute ()
{
* int count = 10;   // <= function scope
  ...

  for (int count = 0; count < 20; ++count) {    // <= block scope
    std::cout << count << "\n";
    ...
  }
  std::cout << count << "\n";
}

This is perfectly legal in C++!

.explain-bottom[ The variable at function scope hides the variable of the same name at global scope

  • at this point, any reference to count is assumed to refer to the function scope variable
  • the global scope version of count still exists, but is no longer accessible by that name ]

Variable shadowing

Variable shadowing (also called name hiding) occurs when two variables of the same name co-exist in different scopes

int count = 0;   // <= global scope

...

void compute ()
{
* int count = 10;   // <= function scope
  ...

* for (int count = 0; count < 20; ++count) {    // <= block scope
    std::cout << count << "\n";
    ...
  }
  std::cout << count << "\n";
}

This is perfectly legal in C++!

.explain-top[ Similarly, the variable at block scope hides the variable of the same name at function scope

  • at this point, any reference to count is assumed to refer to the block scope variable
  • the function and global scope versions of count still exist, but are no longer accessible by that name ]

Variable shadowing

Variable shadowing (also called name hiding) occurs when two variables of the same name co-exist in different scopes

int count = 0;   // <= global scope

...

void compute ()
{
  int count = 10;   // <= function scope
  ...

  for (int count = 0; count < 20; ++count) {    // <= block scope
*   std::cout << count << "\n";
    ...
  }
  std::cout << count << "\n";
}

This is perfectly legal in C++!

.explain-top[ This line will print the block scope version of count ]


Variable shadowing

Variable shadowing (also called name hiding) occurs when two variables of the same name co-exist in different scopes

int count = 0;   // <= global scope

...

void compute ()
{
  int count = 10;   // <= function scope
  ...

  for (int count = 0; count < 20; ++count) {    // <= block scope
    std::cout << count << "\n";
    ...
  }
* std::cout << count << "\n";
}

This is perfectly legal in C++!

.explain-top[ ... but this line will print the function scope version of count, since it is outside the block

  • the block scope version has been destroyed by that point ]

Variable shadowing

Variable shadowing allows a function to use local variables without worrying about whether some other global variable exists elsewhere with the same name

--

... but it can also lead to subtle errors!

--

Name collisions are a general problem in computing

  • many projects will rely on functionality provided by other projects
  • difficult to avoid using the same names across different projects!

--

Most of these problems can be avoided by:

  • not using global variables in the first place! --
  • using a dedicated C++ namespace

name: namespace

class: section

C++ namespaces


What is a namespace?

A namespace is essentially a named scope for your function and variable declarations

--

We have already been using a namespace from the start: the std namespace

  • this namespace is where all the functionality provided by the C++ standard is declared

--

.columns[ .col[

std::vector
  • the vector class (a type) declared within the std namespace ] .col[
std::cout
  • a variable (actually of type std::ostream) called cout declared within the std namespace ] ]

What is a namespace?

A namespace is essentially a named scope for your function and variable declarations

We have been using a namespace from the start: the std namespace

  • this namespace is where all the functionality provided by the C++ standard is declared

.columns[ .col[

`std`::vector
  • the vector class (a type) declared within the std namespace ] .col[
`std`::cout
  • a variable (actually of type std::ostream) called cout declared within the std namespace ] ]

  • these are both declared within the std namespace


What is a namespace?

A namespace is essentially a named scope for your function and variable declarations

We have been using a namespace from the start: the std namespace

  • this namespace is where all the functionality provided by the C++ standard is declared

.columns[ .col[

std`::`vector
  • the vector class (a type) declared within the std namespace ] .col[
std`::`cout
  • a variable (actually of type std::ostream) called cout declared within the std namespace ] ]

  • these are both declared within the std namespace

  • accessing members of a namespace is done with the scope resolution operator (::)


Why would we want to use namespaces?

The main purpose of C++ namespace is:

  • to organise your code into logical units
  • to avoid name collisions

--

For example, we may very well want to declare a variable or function called count

Thankfully, it is declared within the std namespace

  • we can declare our variable or function count without the risk of name collisions with std::count()!

Declaring namespaces

Declaring variables or functions within a namespace is simply a matter of declaring them within the scope of our namespace.

--

For example:

namespace debug {

  bool verbose = false;

}

This declares our variable verbose to be within the debug namespace

  • this also implicitly declares the debug namespace if it hadn't already been encountered

Working with namespaces

Any code also declared within our debug namespace can already access our verbose variable directly.

--

When accessing our variable outside of the debug namespace, we can refer to it using its fully qualified name:

if (debug::verbose)
  ...

--

Alternatively, we can selectively bring one identifier (variable or function) into the current scope with the using declaration:

using debug::verbose;

...

if (verbose)
  ...

Working with namespaces

We can also bring all identifiers declared in a namespace into the current scope with the using namespace directive:

using namespace debug;

...

if (verbose)
  ...

--

The using namespace directive can be problematic

  • the general recommendation is to avoid its use
  • especially at global scope

--

It is much better to use the fully qualified name, or if it really helps code readability, use the selective using declaration for the specific identifiers you need

  • and only use it at function or block scope

The problem with using namespace std;

In spite of the general recommendation to avoid blanket using namespace directives, you will find that many (if not most) C++ tutorials make use of this directive from the very beginning, starting at 'hello world':

#include <iostream>

*using namespace std;

int main() {
  cout << "Hello World!\n";
  return 0;
}

--

This brings everything declared within the std namespace into the global scope

  • this helps to keep the code cleaner by avoiding the need for std:: as a prefix for std::cerr, std::vector, std::string, ...
  • it makes it a bit easier when getting started with C++

The problem with using namespace std;

So why do we not follow this convention on this course?

--

In practice, the using namespace directive is strongly discouraged

  • it bring everything from that namespace into the current scope
  • including variables or functions you might not have known about!
  • it negates the name collision benefits that C++ namespaces were meant to address!

--

It is particularly bad practice to make use of using namespace directives at global scope within header files

  • it's acceptable to use using namespace within your own cpp files
    • it would still not be considered good practice!
    • but you can manage name collisions within your own project --
  • but header files are supposed to be used by other code!
    • any such global declaration will pollute the namespace of any code that #includes your header
    • it will affect the entire translation unit from the point your header is #included

Working with namespaces

For the reasons outlined in the previous slides, we recommend to always use the fully qualified name to access members of a namespace

--

Coming back to our project, we can declare our verbose variable within a new debug namespace to avoid name collisions

namespace debug {
  bool verbose = false;
}

and refer to this variable by its fully qualified name: debug::verbose

--

But we still need to work out where to declare this variable


Working with global variables

As previously mentioned, our debug::verbose variable must be declared prior to its point of use in any translation unit where it is needed

  • this means it needs to be declared in a header file

--

Let's create a header file debug.h to hold this variable:

#pragma once

namespace debug {
  bool verbose = false;
}

--

We can now #include it in shotgun.cpp, fragments.h and overlap.cpp

--

.explain-bottom[ Exercise: implement these changes in your code ]


Working with global variables

When trying to do this, you will most likely have encountered errors of this nature:

/usr/bin/ld: fragments.o:(.bss+0x0): multiple definition of `debug::verbose'; 
    shotgun.o:(.bss+0x0): first defined here
/usr/bin/ld: overlap.o:(.bss+0x0): multiple definition of `debug::verbose'; 
    shotgun.o:(.bss+0x0): first defined here
collect2: error: ld returned 1 exit status

--

The issue is that we have declared and implicitly defined our debug::verbose variable once per translation unit

  • we now have 3 versions of the same global variable, each independently listed in the object files shotgun.o, fragments.o and overlap.o --
  • this causes an error at the linker stage, when the contents of all the compiled object files are brought together to form the final executable: which version is the right one to use?

--

There are 2 ways to get around this:

  • ensure the variable is only declared in the header, and provide a single definition in a separate cpp file
  • tell the compiler to allow multiple definitions of this variable

class: info

Working with global variables

First approach:

Use the extern keyword in debug.h to make it clear that we are only declaring our variable at this point:

#pragma once

namespace debug {
  `extern` bool verbose;
}

--

and place the full definition and initialisation in a new file debug.cpp:

#include "debug.h"

namespace debug {
  bool verbose = false;
}

name: inline_variable

Working with global variables

Second approach (much simpler):

Use the inline keyword in our debug.h header file to allow multiple definitions of the same variable to be provided across multiple translation units:

#pragma once

namespace debug {
  `inline` bool verbose = false;
}

We do not need an additional cpp file in this case, and the initial value can now be specified at the same time

--

Note that this use of the inline keyword was introduced in the C++17 version of the standard

--

.explain-bottom[ Let's go ahead and use the inline keyword in our project ]


Enabling the verbose option

We now have all the pieces required to implement our verbose option

  • we add the debug.h header file as per the previous slide --
  • in main():
    • we use the std::erase() call to detect and handle the -v option
    • and set debug::verbose to true if detected
    • we do this before any attempt at using the command-line arguments --
  • everywhere else:
    • #include "debug.h" so the debug::verbose variable is declared at the point of use
    • check if debug::verbose is true, and if so print out appropriate information

--

For example, in fragments.cpp:

...
std::vector<std::string> load_fragments (const std::string& filename)
{
* if (debug::verbose)
    std::cerr << "reading fragments from file \"" << filename << "\"...\n";
  ...

--

.explain-bottomright[ Exercise: implement these changes in your own code ]


Using the verbose option

We can now run our code as normal:

$ ./shotgun ../data/fragments-1.txt out
initial sequence has size 1000
final sequence has length 20108

which provides a clean output, showing just as much information as we might expect when everything works correctly


Using the verbose option

... or we can run it with the verbose option to see exactly what is going on:

$ ./shotgun ../data/fragments-1.txt out `-v`
reading fragments from file "../data/fragments-1.txt"...
read 190 fragments
190 fragments, mean fragment length: 529.158, range [ 51 1000 ]
initial sequence has size 1000
---------------------------------------------------
189 fragments left
fragment with biggest overlap is at index 41, overlap = -824
---------------------------------------------------

...

fragment with biggest overlap is at index 37, overlap = 51
---------------------------------------------------
104 fragments left
104 fragments remaining unmatched - checking whether already contained in sequence...
final sequence has length 20108
writing sequence to file "out"...

Using the verbose option

The availability of a verbose option allows you and your users to quickly narrow down problems

  • the default output can be kept clean and minimal

  • when issues occur, running in verbose mode provide useful information:

    • in the case of crashes, it provides a more precise indication of when the problem occurred (as indicated by the last message), and hence where to look in the code to figure out the problem
    • in the case of unexpected output, the information reported may point out issues with the input data, or provide clues as to where to look for errors in the code
  • when there are performance issues, verbose mode can provide an indication of which stage is taking longer to complete than expected

  • ...

--

In general, it is recommended to provide some mechanism to allow users to get detailed information about the processing, whether using a verbose option, or by way of a separate log file


Adding convenience functions for debugging

Now that we have the option to provide debugging output, we can wrap up this functionality more cleanly to avoid having lots of if (debug::verbose) statements throughout our code

Let's implement a debug::log() function, which will take a std::string as its argument, and print it to the standard error stream if verbose mode is enabled

  • it can additionally format its output to more clearly label it as debugging information (for example, by prefixing each line with [DEBUG])

--

.explain-bottom[ Exercise: implement this function and make use of it in your own code ]


Adding convenience functions for debugging

In debug.h:

#pragma once

*include <iostream>
*include <string>

namespace debug {
  inline bool verbose = false;

* inline void log (const std::string& message) {
*   if (verbose)
*     std::cerr << "[DEBUG] " << message << "\n";
* }
}

name: inline_function

Adding convenience functions for debugging

In debug.h:

#pragma once

include <iostream>
include <string>

namespace debug {
  inline bool verbose = false;

  `inline` void log (const std::string& message) {
    if (verbose)
      std::cerr << "[DEBUG] " << message << "\n";
  }
}

Note the use of the inline keyword here, this time to allow multiple definitions of the same function


name: inline_keyword

A word about inline

Historically, the inline keyword was considered more as a performance optimisation feature

  • it was used for functions only, to indicate that a given function was suitable for inlining
  • this would suggest to the compiler that the required code could be substituted directly at the point of use, rather than calling that function explicitly
  • There is always a small overhead to calling a function, and for functions that perform small operations and are called very often, this can impact performance
  • declaring such functions as inline can help avoid the overhead of dedicated function calls, and so improve performance
    • this is only possible if the function definition is available, which is why the inline mechanism was introduced to allow definitions to be included in header files without causing the multiple definition problem
  • For further details, refer to other online sources (e.g. GeeksForGeeks)

--

In modern C++ (particularly since C++17), the inline keyword is used to allow multiple definitions of functions or variables across multiple translation units

  • although it does allow the compiler to perform the kinds of optimisation mentioned above, the compiler is free to treat this merely as a hint

Adding convenience functions for debugging

We can now use our convenience function in our code, for example in fragments.cpp:

...
std::vector<std::string> load_fragments (const std::string& filename)
{
* if (debug::verbose)
*   std::cerr << "reading fragments from file \"" << filename << "\"...\n";
  ...

becomes simply:

...
std::vector<std::string> load_fragments (const std::string& filename)
{
* debug::log ("reading fragments from file \"" + filename + "\"...");
  ...

-- .explain-top[ Note the use of the + string concatenation operator instead of the stream insertion operator (<<)


This is because our `debug::log()` function expects a single `std::string` argument ]

Adding convenience functions for debugging

How do we handle this case?

  if (debug::verbose)
    std::cerr << "read " << fragments.size() << " fragments\n";

--

We can't write:

  debug::log ("read " + fragments.size() + " fragments");

since fragment.size() returns an int, and we can't concatenate a std::string with an int!

--

Use std::to_string(), which returns a std::string representation of a variable:

  debug::log ("read " + std::to_string (fragments.size()) + " fragments");

-- .explain-bottom[ Exercise: implement these changes in your own code ]


class: section name: exercises

Exercises


Exercise 1