Question
The DNA sequence of a human genome can be encoded as a string of A's, T's, G's, and C's three billion characters long. The letters
The DNA sequence of a human genome can be encoded as a string of A's, T's, G's, and C's three billion characters long. The letters represent the four DNA base nucleotides used to encode the genomes of all life on Earth. (A = Adenine, T = Thymine, G = Guanine, C = Cytosine).
i. Assuming an optimal fixed-length encoding for the nucleotide bases (A, T, G, C), how many megabytes are required to store 1 human genome? (1 byte = 8 bits, 1 Megabyte = 106 bytes).
ii. In the human genome, the base nucleotides are not equally probable. The proportion of G's and C's is only about 40% while A's and T's constitute about 60% of the genome. (The proportion of A's and T's are equal, as are the proportion of G's and C's.) According to Shannon information theory, what is the entropy of human DNA (i.e., the average information content per nucleotide)?
iii. You have discovered that the entropy is less than what is required of a fixed-length encoding (1.97 < 2). Perhaps we can do better with a variable length encoding! Define a valid variable-length encoding that allocates fewer bits to the more frequent nucleotides. Draw the corresponding Huffman Tree. Is this encoding actually more efficient for representing an entire human genome? Explain your answer by determining the expected number of bits per nucleotide resulting from your candidate encoding.
Step by Step Solution
There are 3 Steps involved in it
Step: 1
Get Instant Access to Expert-Tailored Solutions
See step-by-step solutions with expert insights and AI powered tools for academic success
Step: 2
Step: 3
Ace Your Homework with AI
Get the answers you need in no time with our AI-driven, step-by-step assistance
Get Started