Searching through 3 Billion Letters | Discovering the Genome

Computers can...

Digitally store all that data—there is no way we can keep physical copies of so much information.
Search through each dataset for information.
Compare the datasets to each other.
Compare new data to existing curated datasets.
Find interesting patterns and visualize them.
Use the information to design new drugs.
Use the information to trace evolutionary history.
Use the information to trace spread of infectious diseases.
Use the information to find out the genetic bases of human diseases.

and many more things that society finds intellectually interesting and/or useful.

To understand information contained in our genomes, scientists come up with computer programs for processing data into information. For example, suppose I have your genome. It comprises about 3 billion letters, each letter being either A, C, G, or T. It looks something like this:

“ACGACATCTACTTTCATCGGCGCGGCGGCATAT
ATCGAGCATCGGCGAGCGCGAGCGGCTTACAAAAA…."
and so on for another 3 billion letters.

One of the most common things to do is search through this string of letters for meaningful information. For example, we might want to ask where in this string is the “code for your left index finger.” But, this is a hard problem—like trying to answer this question:

“Where in the Harry Potter book is the passage describing Harry’s anxiety about dark forces?”

Obviously, this isn’t a problem a human could easily solve, never mind a computer. But, if I ask you to find all the places where the name “Voldemort” is mentioned, that is not too hard—just a bit tedious. Well, not only tedious and time-consuming.

This is where computers and algorithms for solving problems come in. Computers are great at tedious tasks that require speedy solutions; we just need to instruct them on how to solve the problem—that is, program the computer with the right algorithm.

With the above ideas in mind, in this exercise we want to cover two concepts:

What is an algorithm, and how do we describe an algorithm for some problem?
Come up with an algorithm to solve a bioinformatics problem.