Question

1 Approved Answer

Posted on Sep 25, 2024

REFER TO 1 QUESTION POSTED BEFORE: https://www.chegg.com/homework-help/questions-and-answers/working-fast-files-create-4-python-scripts-named-ntfastastatspy-secondarystructuresplitter-q108882167?trackid=DxjEbjOD 2main. In your second program ( nt_fasta_stats.py ) you are to open the following file. (Note this file

REFER TO 1 QUESTION POSTED BEFORE: https://www.chegg.com/homework-help/questions-and-answers/working-fast-files-create-4-python-scripts-named-ntfastastatspy-secondarystructuresplitter-q108882167?trackid=DxjEbjOD

2main. In your second program (nt_fasta_stats.py) you are to open the following file. (Note this file has been gzipped, Right-click, "Save Link As" - and make sure to gunzip before using).).

Store the data in two lists like we did before in step 1. Now for each sequence I would like to know the number of A's, T's, G's, C's, and any N's. I'd also like to know the the length of the sequence and also the %GC content of the entire sequence.

pseudocode for your program 1. get file handle to sequences in the fasta file (use get_filehandle function- the name of the output file will come from a command line option) 2. open one outfile (use get_filehandle function- the name of the output file will come from a command line option) 3. loop over fasta filehandle and store the data in two lists, one for the header line and the other for the sequence data (use get_fasta_lists function, see below). 4. process the lists, and determine the necessary output seen below (use output_results_to_files function, see below).

Examples of the program being run:

1. $ python3 nt_fasta_stats.py --infile influenza.fasta --outfile influenza.stats.txt

2. $ python3 nt_fasta_stats.py -h usage: nt_fasta_stats.py [-h] -i INFILE -o OUTFILE Provide a FASTA file to generate nucleotide statistics

optional arguments:

-h, --help show this help message and exit

-i INFILE, --infile INFILE

Path to file to open

-o OUTFILE, --outfile OUTFILE

Path to file to write

3. $ python3 nt_fasta_stats.py usage: nt_fasta_stats.py [-h] -i INFILE -o OUTFILE nt_fasta_stats.py: error: the following arguments are required: -i/--infile, -o/--outfile

4. $ python3 nt_fasta_stats.py --infile influenza.fasta usage: nt_fasta_stats.py [-h] -i INFILE -o OUTFILE nt_fasta_stats.py: error: the following arguments are required: -o/--outfile

You must implement the following functions. Name the functions exactly as instructed below, and provide the same arguments and call them in the same context as instructed. Failure to do so will result in points being deducted. You can use default or non-defualt variables.

1. Write a function (call it get_filehandle, and implement it as the same as above)

2. Write a function (call it get_fasta_lists, and implement it as the same as above)

3. Write a function (call it _verify_lists, and implement it as the same as above)

4. Write a function (call it output_results_to_files) that receives three arguments. 1). The header list found in step 2 directly above. 2). The sequence list found in step 2 directly above. 3). The output filehandle the stats will be written too. This is the main function of this program, since it will print the top line of the output (see below), and each sequence's numerical values. It will call two helper functions (_get_ncbi_accession and _get_num_nucleotides - see below) that will be called for each sequence prior to printing the data for each sequence out. I can call this function like this: # send of the list # process the sequences and print out the data output_results_to_files(list_headers, list_seqs, fh_out)

5. Write a function (call it _get_num_nucleotides) that receives two arguments. 1). The character to find the occurrence of in the dna sequence. 2). The sequence data for that entry (string). I can call this function like this: a_nt = _get_num_nucleotides('A', seq) c_nt = _get_num_nucleotides('C', seq) ... The function should only take A, G, C, T, or N. If any other character is given other than that set, it should sys.exit("Did not code this condition") the program. Example, if I called this function like : _get_num_nucleotides('Y', seq) the function would sys.exit("Did not code this condition")

6. Write a function (call it _get_ncbi_accession) that receives one argument. 1). A string that is the header to the sequence. And returns the accession number. I can call this function like this: accession_string = _get_ncbi_accession(header_string)

The tab delimited output file named by the command line option should look like this (1 decimal point):

Number Accession A's G's C's T's N's Length GC%

1 EU521893 20 20 20 20 0 80 50.0

The numbers above are not accurate, and the Number column is just incremented with each new sequence. The 1st Header line in the sequence input file, should have EU521893 in the output file. Also the white space above represents a single tab between each value, this way you can easily open it in excel!

If you implemented these programs correctly, you should end up with a very short main (4-5 lines of code) - This does not include the command line options, checking the options, closing the filehandles, or comments)

3main. Implement test scripts with your programs. You should name these test_nt_fasta_stats.py and test_secondary_structure_splitter.py. You must get coverage up to 20-30% coverage (Cover) to receive points for this component

Here is an example on how to test for OSError, this will test if your get_filehandle works correctly when it raises an OSError

import pytest

def test_get_filehandle_4_OSError():

# does it raise OSError

# this should exit

with pytest.raises(OSError):

get_filehandle("does_not_exist.txt", "r")

Remember, you're just testing the functions you wrote do what they are expected. This will increase your coverage. Also create a .coveragerc file in your assignment3 directory (Make sure to update what you have for your path to Python)

[run]

omit =

test_*

/usr/local/lib/python3.*/*

To see the html report do $ pytest --cov-branch --cov-report html --cov --cov-config=.coveragerc # see assignment 2's solution for more on testing, Please make sure to review the HTML coverage! This is very helpful in showing you where your code has and has not been tested. Review the lecture 06 code review video, and see solution #2 coverage html for an example. To not see the html e.g. when on Defiance or developing locally: (Note the TOTAL coverage of 81%): $ pytest --cov-branch --cov --cov-config=.coveragerc # see assignment 2 solutions for more on testing