Question

1 Approved Answer

Posted on Sep 26, 2024

CODE THIS PROGRAM IN JAVA!!! Introduction DNA, or deoxyribonucleic acid, is a set of large molecules that make up our genetic blueprint. They are located

CODE THIS PROGRAM IN JAVA!!!

image text in transcribed

Introduction DNA, or deoxyribonucleic acid, is a set of large molecules that make up our genetic blueprint. They are located in each of our cells, packaged as chromosomes. The entire sequence of DNA for all 23 human chromosomes make up the 3 billion base pairs of what is called our genome. The genome for humans has been sequenced since 2003. It has allowed forensic scientists to identify people based on trace samples of DNA. These 3 billion base pairs are made up of the nucleotides cytosine, guanine, adenine and uracil, abbreviated C, G, A, and T, in some pseudo-random combination. Some parts of the genome are pretty fixed in composition, while other parts allow for much diversity between individuals. Identification of individuals An example of this genetic diversity is shown in Short Tandem Repeats (STRS). These are short DNA sequences which repeat, as in AGCACCAGC instead of AGCTGCAGOGTCACC. The first sequence has "AGC" repeating three times. The second example also has "AGC" repeating in a sense, but has intervening bases in the sequence: TGC and GTC. AGCAGCAGC has AGC repeated in an un-interrupted manner, which is why we call that a "tandem repeat". The number of these repeats can vary among individuals, and can be used to identify people. "ACCAGCAGC" consists of 3 tandem repeats of the sequence "AC", while AGCTCCAGOGTCACC" consists of only 1 tandem repeat, even though it occurred 3 separate times. We are only interested in counting the maximum number of tandem repeats in a DNA sample. The tandem repeats can consist of any number of bases Let us suppose that we had a data file of individuals and the number of STR repeats. The first line consists of the number of people, and the sequences in question for STR counts, each field separated by commas. All of the lines after that are the names of the individuals, followed by the number of STRs of each kind as specified on the first line. Example: 3. AGEC, ICAC, CICA Bob, 4, 5, 7 Medina, 2, 6, 12 Lealle, 6, 9,3 So, this means that in the second row, we get the name Bob, followed by the number of repeats of AGTC, TTAC, and GCTA in that order, the same order of appearance of the sequences in the first row. That is, AGTC repeats 4 times, TTAC 5 times, and GTCA 7 times in tandem in Bob's sample. To have all three of these STR's match up can be offered as pretty good evidence that the sample was Bob's. It is also possible that the combinations of STRs don't match anyone in your database. If only one of the STR counts are off, it cannot be considered a match. Your Task Your task is to write a program that will take a sequence of DNA and a CSV file containing STR counts for a list of individuals and then output to whom the DNA (most likely) belongs Program Specification Your program should open the CSV file and read its contents into an array. You may assume that the first row of the CSV file will be the column names. The first number will be the number of individuals (or rows of data), and the remaining columns will be the STR sequences themselves. Your program should open the DNA sequence and read its contents into memory. For each of the STRs (from the first line of the CSV file), your program should compute the longest run of consecutive repeats of the STR in the DNA sequence to identify. If the STR counts match exactly with any of the individuals in the CSV file, your program should print out the name of the matching individual . You may assume that the STR counts will not match more than one individual. If the STR counts do not match exactly with any of the individuals in the CSV file, your program should print "No match". A CSV file is known as a "comma-separated volume", or "comma-separated file". This means that each field is separated by commas or a carriage return. Since a CSV file is nothing more than a plain ASCII text file, there will be no special libraries used to deal with these files, other than what we did in class. CSV files contain no special formatting characters. There will be an object for the three sequences to search for. There will be an object of the individual's name, and the length of the STRs for each of the sequences. These can be an array of objects. The length of the array would be determined by the number in the first line of the data file, and can thus be dynamically assigned Sample Sessions For the CSV file 3,AGA,AAIG, TAIC A1108,3,2,3 Cha:1:6, 6, 2,5 Each of the following sequences would sit in a separate file, containing only the sequence. This will require separate runs. The nucleotide sequence is in a data file in one unbroken line, one data file per individual. To test your code, replace the old sequence with the new one in the same data file, using a text editor (which replit provides). The following sequence should match Alice: AASOCCRATAGACAAAA The following sequence should match with Bob: TANA PAIGAATGAASGAATG The following sequence should match with Charlie: TABATO And the following sequence should have no match: CCIACACATCCALACATACASACAICICCICCACCAATCOSTECCATAAICAAT GAAT GAAT GAASGAAICAAT CACACACCICCAT GCTAGCCGCGCATC ICIA CIASCAACCCCTAC