Answered step by step
Verified Expert Solution
Link Copied!

Question

1 Approved Answer

A DNA string is a sequence of the letters a, c, g, and t in any order. For example, aacgtttgtaaccag is a DNA string of

A DNA string is a sequence of the letters a, c, g, and t in any order. For example, aacgtttgtaaccag is a DNA string of length 15. Each sequence of three consecutive letters is called a codon. For example, in the preceding string, the codons are aac, gtt, tgt, aac, and cag. If we ignored the rst letter and started listing the codons starting at the second a, the codons would be acg, ttt, gta, and acc, and we would ignore the last ag. In this exercise, for simplicity, we will assume that we always start reading the codons at the rst letter of the string.

A DNA string can be hundreds of thousands of codons long, even millions of codons long, which means that it is infeasible to count them by hand. It would be useful to have a simple script that could count the number of occurrences of a speci c codon in such a string. For instance, for the example string above such a script would tell us that aac occurs three times and tgt occurs once.

Your job is to write a script named countcodons that expects two arguments on the command line. The rst argument is a three letter codon string such as aaa or cgt. The second argument is the pathname of a le containing a valid DNA string with no newline characters or white space characters of any kind within it. This le contains nothing but a sequence of the letters a, c, g, and t. If your script is given two valid arguments, it will output a single number, which is the number of occurrences of the codon given as argument 1 in the le given as argument 2. If it nds no occurrences, it should output 0. For example, if the string aacgtttgtaaccagaac is in a le named dnafile, then your script should work like this:

$ countcodons ttt dnafile 1

$ countcodons aac dnafile 3

$ countcodons ccc dnafile 0

Warning: if it is given valid arguments, the script is not to output anything but a number. No fancy messages, no words - just a number! The script should check that it has two arguments and if it does not, it should print a how-to-use-me and then exit. It is not required to check that the le is in the proper form, or that the string is actually a codon. However, for (+3 extra credit), it should print an error message and exit if the le cannot be opened or if it is not a le containing only the four letters, a, c, g, and t. It must do both to receive the credit.

Hint: You will not be able to solve this problem using the grep command alone. There are a number of commands that might be useful, such as sort, cut, fold, and uniq. One of these commands is the key that makes this task easy to solve. Find out which one it is and use it.

Part 2 (20 points)

  1. Create a directory.

  2. In the directory, download the 4 text files uploaded as part of the assignment. Each of

    those files have 600 rows and 12 columns.

  3. Write a script in the directory that goes through the column the user specifies (1-10) of

    all 4 files and calculates the min and the max values. This means that if the user specifies column 5, then go through column 5 of all 4 files and all that should give only one min and one max. You subtract the min from the max and then divide that value by two (finding the average). You then replace values in column one for all 4 files that are less than the average AND greater than the min with ttt and ones that are greater than or equal to the average AND less than the max with gcc.

Here is how the script should work: (the script should ask the user) Please enter a the column you wish to change: 7 (the user enters column 7 in this case, the script goes through column 7 of all the 4 files and arrives at one max and one min. It subtracts the min from the max and divides the difference by 2 to give you J. Then it goes through column 7 of all 4 files and changes any value less than the J and greater than the min with ttt and ones that are greater than or equal to J and less than the max with gcc.

Please make sure you check the user input. If the user puts in anything that is not 1-10 then, you should give them an error message.

Please note that no new files should be created as a result of this script running. In addition, Id advise you to look into sed for replacing values in files. (Hints: sed (replacing words in files), awk (specifying columns), loops)

Part 3

I want you to write another script called Project that runs the script in part one first and then the script in part two.

Step by Step Solution

There are 3 Steps involved in it

Step: 1

blur-text-image

Get Instant Access to Expert-Tailored Solutions

See step-by-step solutions with expert insights and AI powered tools for academic success

Step: 2

blur-text-image

Step: 3

blur-text-image

Ace Your Homework with AI

Get the answers you need in no time with our AI-driven, step-by-step assistance

Get Started

Recommended Textbook for

More Books

Students also viewed these Databases questions