You have been hired by a genome lab to write a java program that will read in information from a text file and produce specific results to the screen and an output file Concepts Arrays of objects Plain text file input output More practice writing supplier code to a specification Background Information About DNA Note This section explains some information from the field of biology that is related to this project It is for your information only you do not need to fully understand it to complete the code Deoxyribonucleic acid (DNA) is a complex biochemical macromolecule that carries genetic information for cellular life forms and some viruses DNA is also the mechanism through which genetic information from parents is passed on during reproduction DNA consists of long chains of chemical compounds called nucleotides Four nucleotides are present in DNA Adenine (A), Cytosine (C), Guanine (G), and Thymine (T) DNA has a double helix structure (see diagram below) containing complementary chains of these four nucleotides connected by hydrogen bonds Certain regions of the DNA are called genes Most genes encode instructions for building proteins (they're called protein coding genes) These proteins are responsible for carrying out most of the life processes of the organism Nucleotides in a gene are organized into codons Codons are groups of three nucleotides and are written as the first letters of their nucleotides (e g , TAC or GGA) Each codon uniquely encodes a single amino acid, a building block of proteins The process of building proteins from DNA has two major phases called transcription and translation, in which a gene is replicated into an intermediate form called mRNA, which is then processed by a structure called a ribosome to build the chain of amino acids encoded by the codons of the gene The sequences of DNA that encode proteins occur between a start codon (which we will assume to be ATG) and a stop codon (which is any of TAA, TAG, or TGA) Not all regions of DNA are genes large portions that do not lie between a valid start and stop codon are called intergenic DNA and have other (possibly unknown) functions Computational biologists examine large DNA data files to find patterns and important information, such as which regions are genes Sometimes they are interested in the percentages of mass accounted for by each of the four nucleotide types Often high percentages of Cytosine (C) and Guanine (G) are indicators of important genetic data For more information, visit the Wikipedia page about DNA http en wikipedia org wiki DNA Input Data Files Your program will be reading DNA information from text files The DNA input file start with an integer, which is the number of name nucleotide sequence pairs in the file The rest of the lines are treated as pairs The first line of the pair is the name of the nucleotide sequence, the second line of the pair is the sequence itself Each character in the nucleotide sequence will be A, C, G, T, or a dash character, The nucleotides can be either lower or upper case The dash characters represent junk or garbage regions of the sequence Here are two sample files dna1 txt and dna2 txt You program does not have to be responsible for files that do not match this format (in other words, if the end user enters a filename with bad data and the program crashes, that's ok) You can create any text file you want for testing (use a program like Notepad or any other basic text editor) Program Overview Your program will read DNA data from a file For each named sequence, produce the following A count of the occurrences of the four nucleotides (A, C, G, and T) The mass percentage occupied by each nucleotide type in the sequence All the codons in the sequence Note, any junk is skipped For example, if the sequence is AC TGA TacT the codons are ACT, GAT , and ACT Whether this sequence is a protein coding gene For our purposes, a protein coding gene is a string that begins with the start codon ATG ends with one of the following stop codons TAA, TAG, or TGA is at least 3 codons long (including the start and stop codons) The program behavior is as follows Display a brief user introduction Prompt for input file name Prompt for output file name Process the input file and write results to the output file Display other results to System out (see below) The code needs to calculate the total mass as well as the mass percentages for each nucleotide To compute these values, use the following mass of each nucleotide (grams mol) Notice that the junk regions do add to the total mass Adenine (A) 135 128 Cytosine (C) 111 103 Guanine (G) 151 128 Thymine (T) 125 107 Junk ( ) 100 000 For example, the mass of the sequence AtcGTAA TC is (135 128 3 111 103 2 151 128 1 125 107 3 100 1) which equals 1254 039 Of that mass 405 384 is from 3 Adenine (32 3 ) 222 206 is from 2 Cytosine (17 7 ) 151 128 is from 1 Guanine (12 1 ) 325 321 is from 3 Thymine (29 9 ) 100 is from 1 dash (8 ) Output File Format Here is a sample of the file output with the format your program needs to follow Seq Name made up sequence Sequence AtcGTAA TC N Counts 3, 2, 1, 3 Tt Mass 1254 0 with 32 3, 17 7, 12 1, 29 9 Codons ATC, GTA, ATC Protein NO Seq Name Sequence N Counts Tt Mass Codons Protein Things to note about the output file You can use the Arrays toString() method to print the arrays of data Line 4 lists the total mass value and the mass percentages of each nucleotide All values are rounded to 1 decimal place Note that the percentages don't add up to 100 That's because the percentages are just for A, C, G, and T, where the total mass includes the junk information There is a blank line between the data for each sequence from the original input file Code Specification Implement the two classes below, each in its own file To get full credit, public interfaces must match these descriptions exactly class Sequence I leave this up to you to determine Sequence(String name, String dnaSequence) initialize this object with the given values Throws NullPointerException if either of the references are null String getName() returns the recorded name of this object String getNucleotides() returns the nucleotide sequence of this object int getCounts() returns the counts of the 4 nucleotides in this Sequence The order of the counts are A, C, G, T double getMass() returns the mass of the 4 nucleotides in this Sequence The order of the data are A, C, G, T double getTotalMass() returns the total mass of this Sequence String getCodons() returns all of the codons in the sequence String toString() returns a String representation of this object, of the form Name dnaSequence boolean isProtein() answers whether this Sequence is a protein or not class SequenceSet I leave this up to you to determine You are limited to only using arrays for this implementation, no other Java data structures SequenceSet(Scanner source) initializes a SequenceSet object to hold Sequences, using the data from the Scanner Note this will crash if the file is not structured the correct way That's fine int getCount() return the number of Sequences read in Sequence get(int i) return the Sequence at the specified index The indexing matches the ordering of the data read in Sequence maxMass() returns the Sequence from this set with the largest total mass Sequence getClosest() returns an array containing two Sequence objects that have the closest total mass These can be any two Sequence objects in the SequenceSet, not necessarily consecutive objects Sequence findSequence(String name) returns the Sequence object in this SequenceSet with the given name Returns null if that name is not in found String toString() returns a String that represents the state of this object The String should contain the toString() for each Sequence object, separated by a new line character User Interface This program has a relatively simple user interface Brief user introduction Prompt for 2 file names, one for input and one for output Your program should not crash or end if the user gives you a filename that can't be read Instead, the program should display a helpful error message and prompt for a new name Read the data from the input file Write to the output file using the output file format noted above To the user, display the Sequence with the largest total mass the two Sequence objects with the closest total mass Put the user interface into its own class There should be 3 Java files in this project Suggestions I hope by this point in the quarter you appreciate the benefit of working pieces of your solution one at a time I recommend building and testing your Sequence class first Next work on reading in the data (the SequenceSet constructor) Then tackle one or two methods at a time If it makes sense to decompose any of these methods, do so However, helper methods should all be private Only these methods listed here should be part of the public interface Be aware of when a method needs to include the FileNotFoundException throws clause (unless you want to use try catch blocks, but that is not a requirement for this assignment) Documentation and Style Make sure to write complete Javadoc comments for each class and each public method Include sufficient internal, algorithm documentation Use appropriate style (variable names, indenting, class constants, etc ) throughout

The Answer is in the image, click to view ...

Question: You have been hired by a genome lab to write a java program that will read in information from a text file and produce specific

You have been hired by a genome lab to write a java program that will read in information from a text file and produce specific results to the screen and an output file.

Concepts

Arrays of objects
Plain text file input/output
More practice writing supplier code to a specification

Background Information About DNA

Note: This section explains some information from the field of biology that is related to this project. It is for your information only; you do not need to fully understand it to complete the code.

Deoxyribonucleic acid (DNA) is a complex biochemical macromolecule that carries genetic information for cellular life forms and some viruses. DNA is also the mechanism through which genetic information from parents is passed on during reproduction. DNA consists of long chains of chemical compounds called nucleotides. Four nucleotides are present in DNA: Adenine (A), Cytosine (C), Guanine (G), and Thymine (T). DNA has a double-helix structure (see diagram below) containing complementary chains of these four nucleotides connected by hydrogen bonds.

Certain regions of the DNA are called genes. Most genes encode instructions for building proteins (they're called "protein- coding" genes). These proteins are responsible for carrying out most of the life processes of the organism.

Nucleotides in a gene are organized into codons. Codons are groups of three nucleotides and are written as the first letters of their nucleotides (e.g., TAC or GGA). Each codon uniquely encodes a single amino acid, a building block of proteins.

The process of building proteins from DNA has two major phases called transcription and translation, in which a gene is replicated into an intermediate form called mRNA, which is then processed by a structure called a ribosome to build the chain of amino acids encoded by the codons of the gene.

The sequences of DNA that encode proteins occur between a start codon (which we will assume to be ATG) and a stop codon (which is any of TAA, TAG, or TGA). Not all regions of DNA are genes; large portions that do not lie between a valid start and stop codon are called intergenic DNA and have other (possibly unknown) functions. Computational biologists examine large DNA data files to find patterns and important information, such as which regions are genes. Sometimes they are interested in the percentages of mass accounted for by each of the four nucleotide types. Often high percentages of Cytosine (C) and Guanine (G) are indicators of important genetic data. For more information, visit the Wikipedia page about DNA: http://en.wikipedia.org/wiki/DNA

Input Data Files

Your program will be reading DNA information from text files. The DNA input file start with an integer, which is the number of name/nucleotide sequence pairs in the file. The rest of the lines are treated as pairs. The first line of the pair is the name of the nucleotide sequence, the second line of the pair is the sequence itself. Each character in the nucleotide sequence will be A, C, G, T, or a dash character, -. The nucleotides can be either lower or upper case. The dash characters represent junk or garbage regions of the sequence.

Here are two sample files dna1.txt and dna2.txt

You program does not have to be responsible for files that do not match this format (in other words, if the end user enters a filename with bad data and the program crashes, that's ok). You can create any text file you want for testing (use a program like Notepad or any other basic text editor).

Program Overview

Your program will read DNA data from a file. For each named sequence, produce the following

A count of the occurrences of the four nucleotides (A, C, G, and T)
The mass percentage occupied by each nucleotide type in the sequence.
All the codons in the sequence. Note, any junk is skipped. For example, if the sequence is "AC-TGA--TacT" the codons are ACT, GAT , and ACT
Whether this sequence is a protein-coding gene. For our purposes, a protein-coding gene is a string that:
- begins with the start codon ATG
- ends with one of the following stop codons: TAA, TAG, or TGA
- is at least 3 codons long (including the start and stop codons)

The program behavior is as follows:

Display a brief user introduction
Prompt for input file name
Prompt for output file name
Process the input file and write results to the output file
Display other results to System.out (see below)

The code needs to calculate the total mass as well as the mass percentages for each nucleotide. To compute these values, use the following mass of each nucleotide (grams/mol). Notice that the junk regions do add to the total mass:

Adenine (A): 135.128
Cytosine (C): 111.103
Guanine (G): 151.128
Thymine (T): 125.107
Junk (-): 100.000

For example, the mass of the sequence AtcGTAA-TC is

(135.128 * 3 + 111.103 * 2 + 151.128 * 1 + 125.107 * 3 + 100 * 1) which equals 1254.039.

Of that mass:

405.384 is from 3 Adenine (32.3%)
222.206 is from 2 Cytosine (17.7%)
151.128 is from 1 Guanine (12.1%)
325.321 is from 3 Thymine (29.9%)
100 is from 1 dash (8%)

Output File Format

Here is a sample of the file output with the format your program needs to follow. Seq Name: made up sequence Sequence: AtcGTAA-TC N Counts: [3, 2, 1, 3] Tt Mass%: 1254.0 with [32.3, 17.7, 12.1, 29.9] Codons : [ATC, GTA, ATC] Protein?: NO Seq Name: ... Sequence: ... N Counts: ... Tt Mass%: ... Codons : ... Protein?: Things to note about the output file:

You can use the Arrays.toString() method to print the arrays of data
Line 4 lists the total mass value and the mass percentages of each nucleotide. All values are rounded to 1 decimal place.
Note that the percentages don't add up to 100%. That's because the percentages are just for A, C, G, and T, where the total mass includes the junk information.
There is a blank line between the data for each sequence from the original input file.

Code Specification

Implement the two classes below, each in its own file. To get full credit, public interfaces must match these descriptions exactly.

class Sequence

I leave this up to you to determine

+ Sequence(String name, String dnaSequence) - initialize this object with the given values. Throws NullPointerException if either of the references are null.

+ String getName() -- returns the recorded name of this object

+ String getNucleotides() -- returns the nucleotide sequence of this object

+ int[] getCounts() -- returns the counts of the 4 nucleotides in this Sequence. The order of the counts are [A, C, G, T]

+ double[] getMass() -- returns the mass of the 4 nucleotides in this Sequence. The order of the data are [A, C, G, T]

+ double getTotalMass() -- returns the total mass of this Sequence

+ String[] getCodons() -- returns all of the codons in the sequence.

+ String toString() -- returns a String representation of this object, of the form "Name:dnaSequence"

+ boolean isProtein() -- answers whether this Sequence is a protein or not

class SequenceSet

I leave this up to you to determine. You are limited to only using arrays for this implementation, no other Java data structures.

+ SequenceSet(Scanner source) -- initializes a SequenceSet object to hold Sequences, using the data from the Scanner. Note: this will crash if the file is not structured the correct way. That's fine.

+ int getCount() -- return the number of Sequences read in

+ Sequence get(int i) -- return the Sequence at the specified index. The indexing matches the ordering of the data read in.

+ Sequence maxMass() -- returns the Sequence from this set with the largest total mass

+ Sequence[] getClosest() -- returns an array containing two Sequence objects that have the closest total mass. These can be any two Sequence objects in the SequenceSet, not necessarily consecutive objects.

+ Sequence findSequence(String name) -- returns the Sequence object in this SequenceSet with the given name. Returns null if that name is not in found.

+ String toString() -- returns a String that represents the state of this object. The String should contain the toString() for each Sequence object, separated by a new line character.

User Interface

This program has a relatively simple user interface:

Brief user introduction
Prompt for 2 file names, one for input and one for output. Your program should not crash or end if the user gives you a filename that can't be read. Instead, the program should display a helpful error message and prompt for a new name.
Read the data from the input file.
Write to the output file using the output file format noted above.
To the user, display
- the Sequence with the largest total mass
- the two Sequence objects with the closest total mass

Put the user interface into its own class. There should be 3 Java files in this project.

Suggestions

I hope by this point in the quarter you appreciate the benefit of working pieces of your solution one at a time. I recommend building and testing your Sequence class first. Next work on reading in the data (the SequenceSet constructor). Then tackle one or two methods at a time.
If it makes sense to decompose any of these methods, do so. However, helper methods should all be private. Only these methods listed here should be part of the public interface.
Be aware of when a method needs to include the FileNotFoundException throws clause (unless you want to use try/catch blocks, but that is not a requirement for this assignment).

Documentation and Style

Make sure to write complete Javadoc comments for each class and each public method.
Include sufficient internal, algorithm documentation.
Use appropriate style (variable names, indenting, class constants, etc.) throughout

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer

Step: 1 Unlock blur-text-image

Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock

Step: 3 Unlock

Students Have Also Explored These Related Databases Questions!

Please go ahead and rate Robyn just like i did. I want to know if i am right or far off from her rating. When completing the vignettes, there is a vignette assumption. That is, when reading assume...

plase write a C++ program that will read a text file, echo it to the screen, and create an array of the words in the file. When the program terminates, it should produce a list of the words...

You have been hired by a genome lab to write a program that will read in information from a text file and produce specific results to the screen and an output file. Concepts . Arrays of objects Plain...

Hint:you have to pack (or typecast) the "key" and the "item" into a "pair" in order to put it into, or remove it from, a map or multimap. A "pair" is like a struct with fields: "first" and "second"...

c++ output: Roses are Red. Violets are blue? Sugar is sweet. And so are you. Two roads diverged in a yellow wood, and Sorry I could not travel both, and be one traveler. Long I stood and looked down...

Write a Java Program that reads given 2 .txt files (notepad can be used to view these files) File A contains infomation about the vehicle type while File B contains the issues linked with that...

Problem You are asked to design a program in Java that would read 2 files (Cars.txt and DetailedReport.txt). The Cars.txt file contains a car type and some coded information about the issues...

Programming II Program #2 February 24, 2021 Due Date: Friday, March 5, 2021 at 10:00 PM You are to write a program that will read a text file, echo it to the screen, and create an array of the words...

(JAVA - DATA STRUCTURES) Hi, THIS IS THE FOURTH TIME I HAVE POSTED THIS QUESTION AND NOBODY WANTS TO HELP ME. PLEASE, I NEED SOMEONE TO HELP ME. I need help with the program CountryDisplayer.java and...

These are all the templates. You do not have to do any of the bonus or extra stuff. I need everything in java and this is all one problem just with separate parts so I need it all answered. Part 1 In...

Overview You will be writing a Java program that produces a simple formatted report. The program reads a file with a specific input format and produces a summary report of the data in that file....

In Example 7-6 in Chapter 7, we calculated the impulse and average force on the leg of a person who jumps 3.0m down to the ground. If the legs are not bent upon landing, so that the body moves a...

A company has a capital structure made up of bonds of $9m, common stock of $15m and preferred stock of $1m. The company operates at its target capital structure. The pre-tax costs of debt are 4%,...

Understanding the sales cycle can help the auditor to understand accounts payable levels. the client's expected sales revenue. new product ideas of the client. the potential for errors in the...

CT Corp Comprehensive Question Canadian Tire Corporation, Limited ( Canadian Tire ) is a family of companies that includes a retail segment and a financial services division, among others. The retail...

List three characteristics of a high-performance work system in a healthcare organization. Be specific. How can HRM play a role in this process?

Do you think physicians should have unions? Why or why not?

Discuss the different types of due process approaches. What experiences have you had with these approaches?