Answered step by step

Verified Expert Solution

Link Copied!

Question

1 Approved Answer

Posted on Sep 22, 2024

Working with FASTA data Modules you can use: sys (Links to an external site.) collections (Links to an external site.) os (Links to an external

Working with FASTA data Modules you can use: sys (Links to an external site.) collections (Links to an external site.) os (Links to an external site.) re (Links to an external site.) argparse (Links to an external site.) What your programs MUST do: You must adhere to the programming specification for this assignment in order to receive full credit. Also you must use command line options for these programs (using argparse (Links to an external site.)), so pay particular attention to the example usage I have provided. See the solution to assignment 2 and lecture 06 for more information on how to implement command line options using argparse (Links to an external site.) (You must use argparse (Links to an external site.)). Command line options will automatically put checks into place, e.g., when an option should be an integer (type=int), it will check that automatically check this for you:

import argparse parser = argparse.ArgumentParser(description='Find Descriptive Statistics for a column in the file') parser.add_argument('-c','--column', dest='column_to_parse', type=int, help='Column to parse in the file to open', required=True ) The data file for this program can be found here (Links to an external site.) (Note this file has been gzipped - Right-click, "Save Link As" - and make sure to gunzip before using - Do this for all .gz files in this assignment).

Suppose we had a company identify all of the amino acids for a group of unknown proteins using Mass Spectrometry (De novo peptide sequencing for mass spectrometry is typically performed without prior knowledge of the amino acid sequence. It is the process of assigning amino acids from peptide fragment masses of a protein) and we also had them computationally determine the secondary structure of those peptide chains.

We had asked them to send us two files. One fasta file with the protein sequences, and the other with the corresponding secondary structure data. But as it turns out they sent us one file, Arghhhhhh!! You are to write a program (secondary_structure_splitter.py) which will open this file and generate two files. One with the corresponding protein sequence (pdb_protein.fasta), and the other with the corresponding secondary structures (pdb_ss.fasta). Make sure to keep the white spaces intact in pdb_ss.fasta because the position corresponds to the amino acid in pdb_protein.fasta. At the end of the program tell the user how many sequences were found for each of the output files, by printing this out to STDERR.

Here's a contrived example of what two sequences look like: Gaps in the secondary structure just mean there is no secondary structure annotation.

>102M:A:sequence GVLSEGEWQLVLHVWAKVEADVAGHGQDIMIRLFKSHPETLEKFDRVKHLKTEAEMKASEDLKKHGVTVLTALGA MLKKKGHHEAELKPLAQSHATKHKIPIKYLEFISEAIIHVLHSRHPGNFGADAQGAMNKALELFRKDIAAKYKEL TYQG >101M:A:secstr THHHHHHHHHHHHHGGGHHHHHHHHHHHHHHH GGGGGG TTTTT SHHHHHH HHHHHHHHHHHHHHHH HHTTHT HHHHHHHHHHHHHTS HHHHHHHHHHHHHHHHHH GGG SHHHHHHHHHHHHHHHHHHHHHHHHT T

pseudocode for your program

get file handle to sequences in the fasta file (use the get_filehandle function, see below).

open two outfiles named: pdb_protein.fasta and pdb_ss.fasta (use the get_filehandle function, see below).

loop over fasta filehandle and store the data in two lists, one for the header line and the other for the sequence data (use the get_fasta_lists function, see below). Remember the solution to the flow chart - this might help.

process the two lists, go over the header list, and if it matches the amino acid sequence print it to pdb_protein.fasta, otherwise print it to pdb_ss.fasta. Remember what FASTA format looks like (see here (Links to an external site.) - note you do not have to follow the recommendation that all lines of text be shorter than 80 characters in length)

You can hard-code pdb_protein.fasta and pdb_ss.fasta file names into your secondary_structure_splitter.py (make sure to use a relative path, and not an absolute path)

Examples of the program being run (Follow the same output): $ python3 secondary_structure_splitter.py --infile ss.txt Found XX protein sequences Found XX ss sequences

# Comment only Note: the XX above in the output should be the actual number of sequences found

$ python3 secondary_structure_splitter.py -h usage: secondary_structure_splitter.py [-h] -i INFILE Provide a FASTA file to perform splitting on sequence and secondary structure

optional arguments: -h, --help show this help message and exit -i INFILE, --infile INFILE Path to file to open $ python3 secondary_structure_splitter.py usage: secondary_structure_splitter.py [-h] -i INFILE secondary_structure_splitter.py: error: the following arguments are required: -i/--infile

$ python3 secondary_structure_splitter.py --infile ss_designed2Fail.txt # This file can be downloaded below Header and Sequence lists size are different in size Did you provide a FASTA formatted file?

Use the following file (Links to an external site.) to test if the program exists correctly. Note, there should be no output files created, see assignment 2's solution for more help on unit testing. You must implement the following functions. Name the functions exactly as instructed below, and provide the same arguments and call them in the same context as instructed. Failure to do so will result in points being deducted. You can use default or non-defualt variables. See lecture 06:

Write a function (call it get_filehandle) that receives two arguments: 1). A file name 2). How to open a file for reading or writing ("r", or "w") in Python. The purpose of this function is to open the file name passed in, and passes back a a file object, aka the file handle or handle. You can call this function like this:

fh_in = get_filehandle(file_2_open, "r") or fh_out = get_filehandle(file_2_write, "w")

When using open(), make sure to use try .. except .. except if the open was not successful the program should raise the right Exception(it should raise an OSError (Links to an external site.) for when the file cannot be opened, and raise a ValueError (Links to an external site.) when the wrong argument was passed for the opening mode, e.g. "rrr" instead of "r"). We will test for things like a file that does not exist for opening, or the wrong open mode, e.g. mode='rrr'. See open (Links to an external site.) for more information. All opening and closing of files in your program should use the get_filehandle function. Failure to do so will loose points. Make sure to close your file handle.

Write a function (call it get_fasta_lists) that receives one argument: 1). A file handle to the fasta file used in this program. The function will return two lists. One lists for the sequences in the file and one list for the headers to the sequences in the file. There should be a one-to-one correspondence to the data in the lists. Meaning element 1 of the header list should correspond to element 1 of the sequence list. If implemented correctly, you can call this function like this:

# send off the filehandle # get back data in lists list_headers, list_seqs = get_fasta_lists(fh_in)

This function should sys.exit("The size of the sequence list...") if it could not successfully get two lists of equal size (see _verify_lists below). The function _verify_lists does the actual exiting, but that will get called to by the get_fasta_lists function. This function can be tested by a unit test.

Make sure "newline " characters have been removed from your sequence data. Note, remove newline characters, not spaces, since the secondary structure string might contain spaces!

Write a function (call it _verify_lists [Note the "_" in front of the name of the function]) that receives two arguments: 1). The header list found in step 2 directly above. 2). The sequence list found in step 2 directly above. This is a helper function that will be called in the get_fasta_lists function (which is why it starts with a "_" in front of the name, since it's not really to be called in the main part of your program, i.e the Single Pre Underscore is only meant to use for the internal use). If the sizes of the lists passed into this function are not the same, it should exit, (telling the user why it exited), else it returns (return True).

If implemented correctly, you can call this function like this:

# check to make sure data looks good. Note, no return value, so no assignment _verify_lists (list_headers, list_seqs)

In your second program (nt_fasta_stats.py) you are to open the following file (Links to an external site.) (Note this file has been gzipped, Right-click, "Save Link As" - and make sure to gunzip before using).). Store the data in two lists like we did before in step 1. Now for each sequence I would like to know the number of A's, T's, G's, C's, and any other type of nucleotide, call all of these N's. I'd also like to know the the length of the sequence and also the %GC content of the entire sequence. pseudocode for your program

get file handle to sequences in the fasta file (use get_filehandle function- the name of the output file will come from a command line option)

open one outfile (use get_filehandle function- the name of the output file will come from a command line option)

loop over fasta filehandle and store the data in two lists, one for the header line and the other for the sequence data (use get_fasta_lists function, see below).

process the lists, and determine the necessary output seen below (use output_seq_statistics function, see below).

Examples of the program being run:

$ python3 nt_fasta_stats.py --infile influenza.fasta --outfile influenza.stats.txt

$ python3 nt_fasta_stats.py -h usage: nt_fasta_stats.py [-h] -i INFILE -o OUTFILE

Provide a FASTA file to generate nucleotide statistics

optional arguments: -h, --help show this help message and exit -i INFILE, --infile INFILE Path to file to open -o OUTFILE, --outfile OUTFILE Path to file to write $ python3 nt_fasta_stats.py usage: nt_fasta_stats.py [-h] -i INFILE -o OUTFILE nt_fasta_stats.py: error: the following arguments are required: -i/--infile, -o/--outfile

$ python3 nt_fasta_stats.py --infile influenza.fasta usage: nt_fasta_stats.py [-h] -i INFILE -o OUTFILE nt_fasta_stats.py: error: the following arguments are required: -o/--outfile

You must implement the following functions. Name the functions exactly as instructed below, and provide the same arguments and call them in the same context as instructed. Failure to do so will result in points being deducted. You can use default or non-defualt variables. See lecture 06:

Write a function (call it get_filehandle, and implement it as the same as above)

Write a function (call it get_fasta_lists, and implement it as the same as above)

Write a function (call it _verify_lists, and implement it as the same as above)

Write a function (call it output_seq_statistics) that receives three arguments. 1). The header list found in step 2 directly above. 2). The sequence list found in step 2 directly above. 3). The output filehandle the stats will be written too. This is the main function of this program, since it will print the top line of the output (see below), and each sequence's numerical values. It will call two helper functions (_get_ncbi_accession and _get_num_nucleotides - see below) that will be called for each sequence prior to printing the data for each sequence out. I can call this function like this:

# send of the list # process the sequences and print out the data output_seq_statistics(list_headers, list_seqs, fh_out)

Write a function (call it _get_num_nucleotides) that receives two arguments. 1). The character to find the occurrence of in the dna sequence. 2). The sequence data for that entry (string). I can call this function like this:

a_nt = _get_num_nucleotides('A', seq) c_nt = _get_num_nucleotides('C', seq) ...

The function should only take A, G, C, T, or N. If any other character is given other than that set, it should sys.exit("Did not code this condition") the program.

Example, if I called this function like : _get_num_nucleotides('Y', seq) the function would sys.exit("Did not code this condition")

Write a function (call it _get_ncbi_accession) that receives one argument. 1). A string that is the header to the sequence. And returns the accession number. I can call this function like this:

accession_string = _get_ncbi_accession(header_string)

The tab delimited output file named by the command line option should look like this (1 decimal point):

Number Accession A's G's C's T's N's Length GC% 1 EU521893 20 20 20 20 0 80 50.0 The numbers above are not accurate, and the Number column is just incremented with each new sequence. The 1st Header line in the sequence input file, should have EU521893 in the output file. Also the white space above represents a single tab between each value, this way you can easily open it in excel!

If you implemented these programs correctly, you should end up with a very short main (4-5 lines of code) - This does not include the command line options, checking the options, closing the filehandles, or comments)

Implement test scripts with your programs. You should name these test_nt_fasta_stats.py and test_secondary_structure_splitter.py. You must get coverage up to 20-30% coverage (Cover) to receive points for this component

Here is an example on how to test for OSError, this will test if your get_filehandle works correctly when it raises an OSError import pytest def test_get_filehandle_4_OSError(): # does it raise OSError # this should exit with pytest.raises(OSError): get_filehandle("does_not_exist.txt", "r") Remember, you're just testing the functions you wrote do what they are expected. This will increase your coverage.

Also create a .coveragerc file in your assignment3 directory (Make sure to update what you have for your path to Python) [run] omit = test_* /usr/local/lib/python3.*/* To see the html report do $ pytest --cov-branch --cov-report html --cov --cov-config=.coveragerc # see assignment 2's solution for more on testing,

Please make sure to review the HTML coverage! This is very helpful in showing you where your code has and has not been tested. Review the lecture 06 code review video, and see solution #2 coverage html for an example.

To not see the html e.g. when on Defiance or developing locally: (Note the TOTAL coverage of 81%): $ pytest --cov-branch --cov --cov-config=.coveragerc # see assignment 2 solutions for more on testing

================================================= test session starts ================================================== platform darwin -- Python 3.8.5, pytest-6.2.5, py-1.10.0, pluggy-1.0.0 rootdir: /Users/cleslin/Documents/teachingCode/BINF6200/assignment3 plugins: cov-3.0.0 collected 18 items

tests/unit/test_nt_fasta_stats.py .......... [ 55%] tests/unit/test_secondary_structure_splitter.py ........ [100%]

---------- coverage: platform darwin, python 3.8.5-final-0 ----------- Name Stmts Miss Branch BrPart Cover ------------------------------------------------------------------- nt_fasta_stats.py 76 13 22 2 85% secondary_structure_splitter.py 66 19 20 3 74% ------------------------------------------------------------------- TOTAL 142 32 42 5 80%

================================================== 18 passed in 0.18s ==================================================

If when you run pytest you see other files be calculated in your coverage (see below) ====================================================== test session starts ====================================================== platform darwin -- Python 3.9.10, pytest-6.2.5, py-1.10.0, pluggy-1.0.0 rootdir: /Users/cleslin/Documents/teachingCode/BINF6200/assignment3 plugins: anyio-3.3.4, cov-3.0.0 collected 18 items

tests/unit/test_nt_fasta_stats.py .......... [ 55%] tests/unit/test_secondary_structure_splitter.py ........ [100%]

---------- coverage: platform darwin, python 3.9.10-final-0 ---------- Name Stmts Miss Branch BrPart Cover ------------------------------------------------------------------------------------------------------------------- nt_fasta_stats.py 76 13 22 2 85% secondary_structure_splitter.py 66 19 20 3 74% /usr/local/lib/python3.9/site-packages/_distutils_hack/__init__.py 96 91 26 0 7% /usr/local/lib/python3.9/site-packages/_pytest/_argcomplete.py 37 36 12 0 2% /usr/local/lib/python3.9/site-packages/_pytest/_code/code.py 699 680 230 3 2% /usr/local/lib/python3.9/site-packages/_pytest/_code/source.py 142 140 57 0 1% /usr/local/lib/python3.9/site-packages/_pytest/_io/terminalwriter.py 113 65 52 13 47% /usr/local/lib/python3.9/site-packages/_pytest/_io/wcwidth.py 25 16 14 2 33% /usr/local/lib/python3.9/site-packages/_pytest/assertion/__init__.py 84 73 30 3 12% /usr/local/lib/python3.9/site-packages/_pytest/assertion/rewrite.py 626 542 214 22 14% /usr/local/lib/python3.9/site-packages/_pytest/cacheprovider.py 314 266 110 10 15% /usr/local/lib/python3.9/site-packages/_pytest/capture.py 560 480 136 18 14% /usr/local/lib/python3.9/site-packages/_pytest/compat.py 162 129 48 7 21% /usr/local/lib/python3.9/site-packages/_pytest/config/__init__.py 806 691 316 28 15% /usr/local/lib/python3.9/site-packages/_pytest/config/argparsing.py 253 208 108 3 17% /usr/local/lib/python3.9/site-packages/_pytest/debugging.py 228 219 64 2 4% /usr/local/lib/python3.9/site-packages/_pytest/deprecated.py 16 15 2 1 11% /usr/local/lib/python3.9/site-packages/_pytest/doctest.py 351 341 112 4 4% /usr/local/lib/python3.9/site-packages/_pytest/faulthandler.py 64 48 8 3 26% reduced on purpose.... ------------------------------------------------------------------------------------------------------------------- TOTAL 14563 12035 5158 469 17%

====================================================== 18 passed in 1.64s =======================================================