After sequencing our RNA sample, we now have a list of short nucleotide sequences (100—200 letters, or base pairs (bp), in length) that we call sequencing reads. But we still don't know which RNA molecules or genes these sequencing reads originated from. Without this information, we cannot say anything about which genes are expressed in our sample.
This is like trying to reconstruct a library after all the books have been run through a wood chipper. You have a bunch of pieces of paper with partial sentences on them, but you don't know which books they originally came from.
Luckily, we have two useful tools to help us solve this problem.
- The entire genome sequence of the organism from which we collected our RNA samples (called the reference genome).
- Computers to do all of the hard work for us.
To return to the wood chipper analogy, this is like having a computer with the full text for every book in the library (reference genome). This computer has a search feature that lets you type in a partial sentence, and it will tell you which books contain this sentence (what locations the sequencing reads could have come from in the entire genome).