#Use this code chunk to run your generate_k_mers function
Finding the Replication Origin, Part II
Preliminary Thoughts
Welcome back! We’ve done a lot of work and made quite a bit of headway so far. As a reminder, our goal through this first chapter is to identify the location of the replication origin within a genome. Up to this point we’ve had a crash course in computing, encountering looping mechanisms (for
loops), programmatic flow control (if
/else if
/else
statements), and developed an ability to write reusable code blocks in the form of functions. We’ve applied all of these techniques to help us in analysing genomes (mostly fake/random genomes that we’ve generated, but we’ll get to real bacterial genomes soon). Our work thus far has culminated in a pair of functions generate_k_mers()
, which will generate all strings of k
consecutive nucleotides within a genome, and count_patterns()
which will count the number of occurrences of a particular pattern
within a genome
. Let’s pick back up from here.
Open RStudio and open the project that is managing your group repository. Open the Quarto Notebook you were working on last time and re-run the code cells which build the generate_k_mers()
and count_pattern()
functions – you can simply run all of the code chunks if you’d like. You’ll be adding on to that notebook today.
Towards Identification of the Replication Origin
Remember from General Biology that DNA replication begins at a certain position in the genome called the origin. The origin consists of a specific set of DNA sequences that signal to the cell’s machinery the correct position to start replication. In bacteria, one of the proteins that helps with replication is called DnaA. This protein binds to sequences in the genome called DnaA boxes, which are strings of 9 nucleotides that are repeated more frequently around the origin of replication.
Armed with the knowledge that these DnaA boxes appear more frequently than one would expect a random sequence of 9 nucleotides to appear, we know that we are searching for frequently occurring 9-mers. Can you combine your generate_k_mers()
and count_patterns()
functions into a new function called generate_frequent_k_mers()
that will build a list of the most frequent k
-mers? Your function should take two parameters, genomeString
and k
.
Challenge 1: Find the most frequent words in a string
Input: A DNA string (
genomeString
) and an integerk
Output: A list containing the most frequentk
-mers ingenomeString
. Note that your list should contain eachk
-mer at most once.
Hint: Use theunique()
function on your list ofk
-mers before returning the object.
To begin this challenge, let’s practice with a small genome . Copy over your generate_k_mers
function and run it on the sequence ACACAGACATCCCACCCC
and consider 3-mers.
Your function should return a list of several 3-mers, but you should notice that some 3-mers (ACA
and CCC
) are repeated. These are the 3-mers that we want to count!
Now, with your group think about how you will combine your generate_k_mers
function with your count_ pattern
function so that it will count and return the most frequent k
-mers. Before you get coding, talk with a group and generate a list of what you will need in this new function, generate_frequent_kmers
. Think about:
How can you use your existing
generate_k_mers
function to generate a list ofk
-mers of lengthk
from agenomeString
? (Hint: you may need to store this as anobject
for yourgenerate_frequent_kmers
function.)How can you use your
count_pattern
function to count this list ofk
-mers that you generated? (Hint: it may involve afor
loop!).How will you tell your function to return only the most frequent patterns? (Hint: think about the objects that you will need to define in your function, or what the function will be counting and returning. This is where the functions
unique()
andmax()
may be useful.)
Think you’ve got it? Use the code chunk below to complete Challenge 1.
#Complete the challenge here
Once you’re confident that your function works, head back over to Rosalind and try solving the Frequent Words Problem. Test your code on the sample data set first – if it works, click the link to download the challenge data set and see if you can generate the correct output. As a reminder, the code to read the data into R is below.
<- scan("PATH_TO_FILE", what = character())
data
#The genomeString
<- data[1]
genomeString
genomeString
#The value of k
<- as.integer(data[2])
k k
To automatically generate the output in the format Rosalind wants it (all k-mers on one line, each separated by a space), you can use the following.
<- generate_frequent_k_mers(genomeString, k)
freq_k_mers
noquote(freq_k_mers)
The code stores the list of most frequent k-mers into the object freq_k_mers and then says that we should print each of these k-mers out with no quotation marks around them.
Great work. Now that we are feeling good about ourselves, things are going to get messy. Unfortunately the nucleotides in these DnaA boxes we are looking for, don’t always appear in the same way. For example, just like humans make mistakes while typing, DNA polymerase can make mistakes while it replicates the genome – what if one of the base pairs wasn’t copied correctly during the previous round of replication? Additionally, the replication signal we are looking for may appear as both the k-mer signaling that replication should begin, or as its reverse complement. Because of this, we’ll need to build code that is flexible enough to count reverse complements as well as near k-mers (with some pre-defined number of discrepancies allowed). Let’s get on it.
We’ll start with the reverse complement problem. Recall from gen-bio that A and T are complementary base pairs, as well as G and C. This means that in double-stranded DNA, we should see across from each Adenine (A) a Thymine (T), across from each Cytosine (C) a Guanine (G), and vice-versa. This means that if we look across from a segment of DNA, we can see the reverse complement of that segment. The “reverse” refers to the fact that the complementary sequence is also anti-parallel, meaning 5’ and 3’ ends are oriented in opposite directions on each side of the double strand. Thus, we read the complementary sequence in reverse order. For example, the reverse complement of ACCTGA
is TCAGGT
.
Let’s write a function to compute the reverse_complement()
of a genomeSubString
. While you’re at it, solve your third Rosalind challenge.
Challenge 2: The Reverse-Complement Problem
Input: A pattern of nucleotides called
genomeSubString
Output: The reverse-complement ofgenomeSubString
Like you did with Challenge 1, think about what your function will need to do and what existing tools you have to use. Talk through this with your group before you begin. Here are some hints:
You might need the function
rev()
. Pull up yourrandgenome
function and test out howrev
() works on a genome that you generate. How is this useful to the reverse complement problem? How can you build this into your function?To get the complement of your genome, you may need to use
if/else
statements. Review what you know about these and think about how you need to use them in yourreverse_complement
function.
Think you got it? Test out your code below.
#Complete the challenge here
Summary and Debrief
We’ve made quite a bit of headway! We’re able to search a genome for frequent “near”-k-mers and their reverse-complements. These are the genetic substrings which will signal replication to the DNA polymerase. Next time, we’ll look at narrowing down the area we should be looking in for the replication origin. We’ll exploit properties of genetic transcription and mutation to solve this.