Bioinformatics - The Reverse Compliment
What is DNA?
DNA is your (and every living thing's) genetic code. It has instructions about how to produce proteins that your body will use to digest nutrients, react to external stimuli, have back acne, etc.
DNA stands for deoxyribose nucleic acid, which is a molecule formed of a multitude of a smaller molecular unit, the nucleotide. A molecule
of DNA will consist of two strands of nucleotides, annealed together.
We will explore how DNA actually dictates what proteins are formed later, for now let's learn the
basics of DNA.
Nucleotides
Nucleotides make up a sequence of DNA. This unique sequence is what makes you different from
Jeff and Jeff different from his mother. The uniqueness of your DNA sequence (about 3 billion
nucleotides long) is enough to distinguish you from any other human. While the DNA sequences of any two humans are
typically a fraction of a percent different, this fraction of a percent difference is responsible for
all sorts of diversity in the human population. For example, height, eye color, hair color, nose shape, skin tone are all determined genetically.
Deviations in this sequence across generations is what drives evolution.
Nucleotides come in four flavors: adenine, guanine, cytosine, and thymine. Or A, G, C and T for short.
In bioinformatics problems, we would call "AGCT" our alphabet.
Complimentary Strands
When biologists record the DNA sequence of an organism (done with next-generation sequencing technology),
it is stored as a string of letters: ATGCTAGTA. But this neglects that the molecule is double-stranded,
how could they leave out the information on the other strand? The answer is because the sequence of one strand can be implied from the other.
DNA strands only anneal together if they are complimentary, which is because only certain nucleotides fit together based
on their molecular structure. For example, A can only complement T and vice versa while G can only compliment C.
This property of DNA allows your cells to make copies of their DNA and pass it on to a newly generated cell.
For example the DNA molecule:
ATGGCATGA
TACCGTACT
Notice that only A's and T's can be across from each other while only G's and C's can only be across from each other.
DNA is antiparallel
In addition to being double-stranded and having complimentary strands, DNA also has directionality.
This is because DNA has to be read in a sequence for meaning to be understood, similarly to human languages.
For example, we read and write english from left to right. DNA is exactly the same, but it is read from 5' to 3' (5 prime to 3 prime)
These directions are implied by the structure of the nucleotides and we can leave that detail out. But know that each strand is read in the opposite direction.
Another way to say this is that the strands are 'antiparallel'. So the above DNA example rewritten with directions:
5'-ATGGCATGA-3'
3'-TACCGTACT-5'
Notice how the top strand starts at 5' and goes to 3'. If we were to read information on the bottom strand, we would actually be reading it backward from how it appears above, like this:
5'-TCATGCCAT-3'
Problem: The reverse compliment
Can you write a function that will take in a strand of DNA and output it's complimentary strand in the 5' --> 3' direction?
This is what biologists call the 'reverse compliment'.
For example:
5'-ATGGCATGA-3' --> 5'-TCATGCCAT-3'
def reverse_compliment(dna):
pass
class MyTestCase(unittest.TestCase):
def test_rev_comp_is_correct(self):
self.assertEqual(reverse_compliment("A"), "T")
self.assertEqual(reverse_compliment("ATG"), "CAT")
self.assertEqual(reverse_compliment("GATCCCTATTGGATATCTAGCATCATA"), "TATGATGCTAGATATCCAATAGGGATC")
self.assertEqual(reverse_compliment("ATGGCATGA"), "TCATGCCAT")