Course Module Image

In the video you saw, the narrator mentioned the idea of looking for “sequence motifs.” A sequence motif is a particular string of letters whose pattern is repeated at least once in a long string. For example, suppose we have the following DNA string:

AACCGGTACTCCGGATTTTCAAGCCGGATAT

The string A is repeated eight times. (To keep things simple we define a string to be a sequence of one or more letters; a single A is also called a string.) So we could call this a “sequence motif A.” And this motif is found eight times in the above string. However, this is not really an interesting motif; we expect many occurrences of the single letter A just by chance. We're interested in longer motifs that are unlikely to happen by chance. For example, consider the string CCGG. This occurs three times in the above string as shown in blue below:

AACCGGTACTCCGGATTTTCAAGCCGGATAT

This is an interesting motif because we don’t expect this to happen frequently just by chance. We would call CCGG a four-letter motif, because it occurs more than once and has four letters.

Sometimes strings have multiple motifs. There are two in the below string that occur twice each. Can you identify them?

TTCTTGAGATAGACCCGCTTGTCTACATTCGCT

ANSWER -- SPOILER ALERT!

CTTG: TTCTTGAGATAGACCCGCTTGTCTACATTCGCT

CGCT: TTCTTGAGATAGACCCGCTTGTCTACATTCGCT


We're interested in such strings that occur frequently because such strings often encode information for the organism. For example, most genes contain the motif “TATAAA”  in their promoters, and this motif provides information for the start of mRNA transcription for that gene. That is, our idea is “if some pattern is found more frequently than expected there must be some functional reason.”

Problem: Find all four-letter motifs in the following string.