Course Module Image

Books with Letters

There are special techniques that can help us solve this problem of the colors becoming more confused as the synthesized string becomes longer, but those techniques have other kinds of problems. So, typically, we can read about 150 nucleotides at a time.

This means that the human genome with 3 billion letters will be read in short stretches at a time. This is like trying to read a book with the pages all ripped out and torn into pieces. We have to somehow put the book back together to read the story.

 

The process is something like the following: Suppose we have a string like this (pretend this is the whole genome string):

 

ACCGTCGATCGATCGATCGACGATTCGTC

 

This ­­­­­genome string is cut up into small pieces because the NGS machine can only accurately sequence small strings:

 

ACCGTC  GATC  GATCGAT  CGAC  GATTCGTC

 

Each of the above short strings are sequenced by the NGS machine and reported to us. But because the machine has no idea what short string follows what other short string, the result may be out of order.

 

GATC

ACCGTC

CGAC

GATTCGTC

GATCGAT

 

We need to paste the strings back together to get the original string. But, as you can see, the problem is hard. But how do you know the order in which to put them together? If I paste each of the little strings above, end to end, I get:

 

GATCACCGTCCGACGATTCGTCGATCGAT

 

This is different from the string we started with because the order of the little strings did not follow the same order in the original string.

 

We can figure out what the original string sequence order is if we use the following approach. Multiple copies of the original string are cut into pieces at different places:

 

Original genome string: ACCGTCGATCGATCGATCGACGATTCGTC

 

Cut 1: ACCGTC  GATC  GATCGAT  CGAC  GATTCGTC

Cut 2: ACCG  TCGA  TCGATC  GATCG  ACGATTCGTC

 

Now, say the NGS machine reports the sequences as:

 

Sequences from Cut 1:

GATC

ACCGTC

CGAC

GATCGAT

GATTCGTC

 

Sequences from Cut 2:

TCGA

ACCG

TCGATC

ACGATTCGTC

GATCG

 

Even if these little strings are out of order, we can use the information from the different cuts of the same string to glue them together. For example, we know that the string GATC and GATCGAT (in Cut 1) can be glued together because we also have the string TCGATC (in Cut 2) that goes across the two strings:

 

GATCGATCGAT

----TCGATC-----

 

This is like putting together a jig-saw puzzle. Imagine doing this for millions of little strings. That would be impossible for humans to solve. So we use computers and computer algorithms, as we discuss in the two bioinformatics modules.

 

Why sequence genomes? Once we have the full genome puzzled together, we can try to identify variations in the sequence that might explain disease or interesting traits. This can be done for humans or any other type of organism.