Question

1 Approved Answer

Posted on Aug 27, 2024

Using Java DNA is the fundamental encoding of the instructions that govern the operation of living cells and, by extension, biological organisms. You can think

Using Java

DNA is the fundamental encoding of the instructions that govern the operation of living cells and, by extension, biological organisms. You can think of DNA as a storage medium in which the program that executes within all of your cells is written. The "machine code" of DNA, corresponding to the byte-code of Java, consists of only four nucleotides: four amino acids that are arranged in a linear sequence along the DNA molecule. These four bases are: guanine (G), adenine (A), thymine (T), and cytosine (C). So, a DNA molecule can be represented as a string made up of those four letters. The science of bioinformatics is largely concerned with computations on such genetic strings, or sequences. There are a variety of computations that one might perform on genetic sequences. We will investigate two types: basic statistics of individual sequences and pairwise alignments used to compare pairs of sequences.

Your program will first prompt the user to enter a single DNA sequence, which it should validate for legality (i.e., only the four valid bases) you might do this validation by writing a function that takes a String as a parameter and returns a boolean. Re-prompt the user if the input was invalid. Once you have a valid input, compute the following statistics (each should be implemented as a separate function, called from main()). -Count the number of occurrences of "C". -Determine the fraction of cytosine and guanine nucleotides. For example, if half of the nucleotides in the sequence are either "C" or "G", the fraction should be 0.5. -A DNA strand is actually made up of pairs of bases in effect, two strands that are cross-linked together. These two strands are complementary: if you know one, you can always determine the other, or complement, because each nucleotide only pairs up with one other. In particular, "A" and "T" are complements, as are "C" and "G". So, for example, the complement of the sequence "AAGGT" would be "TTCCA". Compute the complement of the input sequence.

During reproduction, DNA sequences from both parents are replicated and "mixed" to form the DNA of their offspring. This process is not 100% accurate, and errors, or mutations, creep into the genome. Sometimes, these mutations have no effect, sometimes they are immediately lethal and the offspring isn't viable, and sometimes they result in changes in characteristics that may make the offspring more competitive when it comes times for it to breed (or may make it more competitive if there is an environmental change). This mutation process is one element that underlies evolution. A result of evolution is that, after the fact, you can compare two nucleotide sequences and test the hypothesis that they share an evolutionary history. Such comparison allows us to learn how modifications to DNA result in modifications of biochemical processes and physical characteristics. This is why sequence alignment techniques are important. We determine an alignment by comparing two sequences and seeing how well they match. A very simple method for this comparison is to look at corresponding nucleotides and compute a score for that potential alignment. If there are multiple potential alignments, then the one with the highest score would be considered most likely. For example, let's say that the two input sequences are "AATCTATA" and "AAGATA". There are three possible alignments:

AATCTATA AAGATA

AATCTATA AAGATA

AATCTATA AAGATA

In general, mutations can be a substitution of one nucleotide for another (for example, a "G" being replaced by a "T"), an insertion that adds one or more nucleotides, or a deletion that deletes one or more nucleotides. To keep things simple, we will concern ourselves only with the first of these three: point mutations. For simple, gap-free alignments, we compute a score using a simple rule: if the two corresponding characters match, we add a match score of one (1); if they don't match, the match score is zero (0). The total score for an alignment is the sum of the character scores, and the alignment with the highest score is the best match. So, for example, the scores for the three alignments above are 4, 1, and 3, and the best alignment is the first one. You will use this simple alignment method in your program.

Your program will prompt, via the console, for the first sequence and compute its basic statistics, then prompt, and validate, user input of a second sequence. It will compute that second sequence's basic statistics, too. Then, your program will compute the scores for all possible alignments for those two strings (you will want to have a method that takes two strings, plus an offset for shifting the shorter string relative to the longer one, and returns an int score) and determine the best alignment score. Finally, it will print out a report of the results. Thus, for the two input sequences "AATCTATA" and "AAGATA", your program will output the following report:

Sequence 1: AATCTATA C-count: 1 CG-ratio: 0.125 Complement: TTAGATAT Sequence 2: AAGATA C-count: 0 CG-ratio: 0.167 Complement: TTCTAT Best alignment score: 4 AATCTATA AAGATA

Note that this can be done by using arrays. However, please do not use any arrays in this program.