Finding the Replication Origin, Part I

Author

Dr. Gilbert and Dr. Duryea

Preliminary Thoughts

Welcome back! Today we begin to tackle our first real bioinformatics application. We’ll be seeking out the section of the genome which signals for replication of the genetic code. We’ll continue using short, randomly generated strings of nucleotides to help us build and test code but then we’ll apply it to the Vibrio Cholerae genome (and other bacterial genomes) to obtain real results.

Now, open up RStudio, open the project that you built and connected to your group repository, and open up your Replication_<YOURNAME>.qmd file. Like last time, you’ll be completing the tasks outlined here in that notebook. Be sure to add your own discussion throughout, rather than simply using your notebook as a “code-container”. Your groupmates and your future self will thank you.

Throughout the activity you’ll be making progress on the Challenge tasks outlined in this ReplicationOrigin_PartI notebook. You are encouraged to Push your changes out to the “origin” repo, and to Pull in the changes made by your group-mates often. Your code will help them make progress, and their code will help you. When looking at your group-mates notebooks, please open their *.html files – do not open and edit their *.qmd file. This will lead to badness that we don’t want to encounter yet.

Notebook Objectives

We’ll be building on our previous work in the following ways:

  • Learn to eliminate “single-use code” by encapsulating code in functions which can be easily applied to different genomes.
  • Think about, and implement, techniques which will lead to efficient code.
  • Expand your ability to count the frequency of each individual nucleotide within a genome to an ability to count the appearance of patterns in the genome.
  • Build and execute functions to find “hidden messages” within the genome by identifying patterns appearing much more often than they would be expected to if a genome included nucleotides at random.
  • Exploit transcription errors to narrow a search region for the replication origin.

The Search for the Replication Origin

Being able to count frequencies of individual nucleotides is great, but it gives us a really limited view of the genome at a truly global scale. We’re likely to obtain more valuable information if we can count frequencies of sequences of nucleotides within a genome rather than the individual nucleotides. That will be our objective through this section.

Remember, during the process of DNA replication the proteins that help with the process must locate the origin of replication, or the specific part of the genome where replication begins. Thus, if we develop a program that can locate specific patterns in the genome, this could help us find the origin in the same way the proteins would locate it to begin replication.

Before we get to writing a program that can do this, it is worth discussing reusable code. Up to this point, all the code we’ve written has been used once and then discarded. It’s time for some recycling – no more single-use code! Let’s talk about building functions in R that can be used over and over again.

I’ll build a simple function below that takes in two arguments, a string (genomeString) and a nucleotide (nucleotide). The result of the function will be the frequency of the nucleotide within the genomeString. Then we’ll use the function to count the frequency of Guanine in a particular string of nucleotides.

nucleotide_frequency <- function(genomeString, nucleotide = "A"){
  count <- 0
  for(i in 1:nchar(genomeString)){
    if(str_sub(genomeString, start = i, end = i) == nucleotide){
      count <- count + 1
    }
  }
  return(count)
}

nucleotide_frequency("ACTTGCGGGTATCGAG", "G")
[1] 6

Notice that we define a function with the function keyword and provide the arguments/parameters it expects inside of parentheses. Functions can have any number of arguments and we can even supply default values if we wish (for example, I set the default for the nucleotide parameter to be A). Similar to the for loop and if statement, the body of the function appears between curly braces {} – the indentation is not necessary in R, but it does improve readability. The body of the function contains the instructions to be run when the function is called, and the body ends with a return() statement, containing the object that is to be returned after the function has been applied to its inputs. The function is stored using the arrow operator (<-) and the object containing the function defines the function’s name. The best part about functions is that we can use them over and over again!


Challenge 1: Use the sample() function in conjunction with the paste(..., collapse = "") method to generate a random genome of length at least 2000. Use the nucleotide_frequency() function to count the frequency of Cytosine in the random genome you constructed. Are you surprised by the result? As a follow-up, what happens if you don’t provide a second argument to the nucleotide_frequency() function? What is being counted in this case?


#Complete the challenge here.

Okay, good work! We’ve seen how useful it can be to generate random genomes of various lengths. Let’s make a function that we can reuse over and over again to generate a random genome of whatever size we would like.


Challenge 2: Build a function rand_genome() which takes a single parameter k, denoting the number of nucleotides in the genome we wish to generate. Your function should return a single genome string of length k.


#Complete the challenge here

Now that we’ve seen how to construct and use functions, let’s get back to the task at hand – identifying the replication origin within the genome. In our earlier notebooks (and again, here) you learned how to generate your own random genome substring and you’ve built up techniques for accessing and analysing the substring one nucleotide at a time. What if we wanted to consider three-nucleotide sequences rather than just a single nucleotide?

Before we can move forward, it is useful to point out that str_sub() can be used to extract consecutive characters within a string, between its start and end parameters. Up to this point we had just been setting start and end to the same value.

As an example, we can extract the first two and last two letters in the name "Gilbert" using the code below.

myString <- "Gilbert"

str_sub(myString, start = 1, end = 2)
[1] "Gi"
str_sub(myString, start = -2, end = -1)
[1] "rt"

Using negative integers to indicate start and/or end positions tells str_sub() that it should start counting from the end of the string rather than the beginning. We could have also used str_sub(myString, start = 6, end = 7) to get the last two letters, but that requires knowing how long myString is ahead of time!

If, for some reason, we wanted to store these sets of consecutive characters into an object for use later on, we can do so with a list. Here’s how we might do that.

#Initialize an empty vector
myList <- c()

#Append (add to the end) the first set of two characters to myList
myList <- myList %>%
  append(str_sub(myString, start = 1, end = 2))
#Append the second set of two characters to myList
myList <- myList %>%
  append(str_sub(myString, start = -2, end = -1))

#Print out the contents of myList
myList
[1] "Gi" "rt"

Now, let’s combine these principles with what we know about loops to make a simple for loop that will “walk” through our string to return consecutive characters. This is useful because most genomes will be very long, unlike Dr. Gilbert’s name.

For example, let’s imagine we wanted our loop to go through Dr. Gilbert’s name and store each pair of 2 consecutive characters in a list. We can call these pairs “2-mers”. Here is how we might do that:

generate_2_mers <- function(myString) {
  list_2_mers <- c()

  for(i in 1:(nchar(myString) - 1)){
  list_2_mers <- list_2_mers %>%
  append(str_sub(myString, start = i, end = i + 1))
    }
  return(list_2_mers)
}

generate_2_mers(myString)
[1] "Gi" "il" "lb" "be" "er" "rt"

Now you are ready to take on a challenge. See if you can adapt the loop above to walk through a longer string (a random genome) and store consecutive substrings of 3 nucleotides.


Challenge 3: Build a function called generate_3_mers() to generate all of the substrings of 3 nucleotides (we’ll call these 3-mers) in a genome string. Your function should accept a single argument gemomeString and return a list containing all of the 3-mers in genomeString (including repeats). Once you’ve built your function, use the rand_genome() function you built earlier, to construct a random genome of length at least 2000 nucleotides – then use generate_3_mers() to compute a list of all the 3-mers in your random genome.


#Complete the challenge here

Did your function collect a list of 3-mers? How many 3-mers did your function produce? Did you produce only 3-mers or did you end up with a 2-mer and single nucleotide as your final elements? To answer this question, use the built-in tail() function on your list of generated 3-mers. Go back and adjust your code so that the loop inside your function terminates before it runs out of characters in your string.

Question: If you have a genome string of length n nucleotides. How many k-mers will it contain? At what position will the final k-mer start? Remember that R starts counting from position 1.

Armed with the answer to that question, update your generate_3_mers() function to a more flexible version called generate_k_mers() as described in the challenge below.


Challenge 4: Create an updated version of the generate_3_mers() function so that it can generate k-mers of any length we desire (provided that length is greater than 0 and less than the size of the genomeString we begin with). This generate_k_mers() function should take two parameters – genomeString and k, where k controls the number of nucleotides in each k-mer. Similar to generate_3_mers(), this function should return a list of all k-mers in genomeString (including repeats).


#Complete the challenge here

Finally, let’s see if we can create function to find specific patterns. This will be useful in finding the particular combination of nucleotides that are found at the origin of replication.

To model how we might do this, let’s imagine we wanted to create a function that went through Dr. Gilbert’s name and counted the number of times the pattern “be” appeared. We could combine our knowledge of loops, counter variables, and functions to do this:

count_pattern <- function(myString, pattern){
  count <- 0
  for(i in 1:nchar(myString)){
    if(str_sub(myString, start = i, end = i+1) == pattern){
      count = count + 1
    }
  }
  return(count)
}
  
count_pattern(myString, "be")
[1] 1

Now that you know about functions, you are nearly ready to tackle your second Rosalind problem. Take on the challenge below and then we’ll apply your solution to a much larger-scale problem.


Challenge 5: Write a new function, called count_pattern() that will count occurrences of a particular pattern within a larger genomeString. Your function should take two arguments, genomeString and the pattern we want to count occurrences of. Your function should return the count of occurrences of the pattern within the genomeString.


#Complete the challenge here

That was a tough challenge – did you build a function that accomplishes what you intended it to? How do you know?

Once you’ve confirmed that your count_pattern() function works as intended, proceed to the final challenge of this notebook, below.


Challenge 5: Navigate to this challenge problem on the Rosalind site. Make sure that your code solves the sample problem as suggested. Once you’ve solved the sample problem, try solving the challenge by clicking the blue Download dataset button – you’ll have 5 minutes to upload a solution.

  • The dataset will be pulled into your downloads folder.
  • Use the scan() function with the file-path as the first argument, and pass a second argument what = character() to read the dataset into R. Store the result in an object called data.
  • The result will be a list of two elements. The item in data[1] will be the genomeString, while the item in data[2] is the pattern.
  • Use your function to count the number of occurrences of the pattern in the genome string and submit the result to Rosalind!

#Complete the challenge here

Summary and Debrief

Yet again, you’ve accomplished quite a bit. You’ve learned to use functions to write reusable code. You’ve expanded your ability to work with the genome and are now able to list patterns of a specific length, and to count the number of occurrences of a particular pattern within a genome. These two tasks form the foundational building blocks for identifying the Replication Origin! See you next time.