Question

1 Approved Answer

Posted on Sep 22, 2024

working with fast files, create 4 python scripts named nt_fasta_stats.py, secondary_structure_splitter.py, test_nt_fasta_stats.py , test_secondary_structure_splitter.py, .coveragerc, README.md Modules you can use: sys collections os re argparse

working with fast files, create 4 python scripts named nt_fasta_stats.py, secondary_structure_splitter.py, test_nt_fasta_stats.py, test_secondary_structure_splitter.py, .coveragerc, README.md

Modules you can use: sys collections os re argparse What your programs MUST do: You must adhere to the programming specification for this assignment in order to receive full credit. Also you must use command line options for these programs (using argparse Links to an external site.), so pay particular attention to the example usage I have provided. See the solution to assignment 2 and lecture 06 for more information on how to implement command line options using argparse Links to an external site. (You must use argparse Links to an external site.). Command line options will automatically put checks into place, e.g., when an option should be an integer (type=int), it will check that automatically check this for you (example from solution02):

import argparse

def get_cli_args():

"""

Just get the command line options using argparse

@return: Instance of argparse arguments

"""

parser = argparse.ArgumentParser(description='Find Descriptive Statistics for a column in the fh_in')

parser.add_argument('-c',

'--column',

dest='column_to_parse',

type=int,

help='Column to parse in the fh_in to open',

required=True)

1main. The data file for this program can be found here Links to an external site. (Note this file has been gzipped - Right-click, "Save Link As" - and make sure to gunzip before using - Do this for all .gz files in this assignment). Suppose we had a company identify all of the amino acids for a group of unknown proteins using Mass Spectrometry (De novo peptide sequencing for mass spectrometry is typically performed without prior knowledge of the amino acid sequence. It is the process of assigning amino acids from peptide fragment masses of a protein) and we also had them computationally determine the secondary structure of those peptide chains. We had asked them to send us two files. One fasta file with the protein sequences, and the other with the corresponding secondary structure data. But as it turns out they sent us one file, Arghhhhhh!! You are to write a program (secondary_structure_splitter.py) which will open this file and generate two files. One with the corresponding protein sequence (pdb_protein.fasta), and the other with the corresponding secondary structures (pdb_ss.fasta). Make sure to keep the white spaces intact in pdb_ss.fasta because the position corresponds to the amino acid in pdb_protein.fasta. At the end of the program tell the user how many sequences were found for each of the output files, by printing this out to STDERR. Here's a contrived example of what two sequences look like: Gaps in the secondary structure just mean there is no secondary structure annotation.

>102M:A:sequence

GVLSEGEWQLVLHVWAKVEADVAGHGQDIMIRLFKSHPETLEKFDRVKHLKTEAEMKASEDLKKHGVTVLTALGA

MLKKKGHHEAELKPLAQSHATKHKIPIKYLEFISEAIIHVLHSRHPGNFGADAQGAMNKALELFRKDIAAKYKEL

TYQG

>101M:A:secstr

THHHHHHHHHHHHHGGGHHHHHHHHHHHHHHH GGGGGG TTTTT SHHHHHH HHHHHHHHHHHHHHHH

HHTTHT HHHHHHHHHHHHHTS HHHHHHHHHHHHHHHHHH GGG SHHHHHHHHHHHHHHHHHHHHHHHHT

pseudocode for your program 1. get file handle to sequences in the fasta file (use the get_filehandle function, see below).

2. open two outfiles named: pdb_protein.fasta and pdb_ss.fasta (use the get_filehandle function, see below). 3. loop over fasta filehandle and store the data in two lists, one for the header line and the other for the sequence data (use the get_fasta_lists function, see below). Remember the solution to the flow chart - this might help. 4. process the two lists, go over the header list, and if it matches the amino acid sequence print it to pdb_protein.fasta, otherwise print it to pdb_ss.fasta. Remember what FASTA format looks like (see here Links to an external site. - note you do not have to follow the recommendation that all lines of text be shorter than 80 characters in length) 5. You can hard-code pdb_protein.fasta and pdb_ss.fasta file names into your secondary_structure_splitter.py (make sure to use a relative path, and not an absolute path)

Examples of the program being run (Follow the same output):

1. $ python3 secondary_structure_splitter.py --infile ss.txt Found XX protein sequences Found XX ss sequences

2. $ python3 secondary_structure_splitter.py -h usage: secondary_structure_splitter.py [-h] -i INFILE

Provide a FASTA file to perform splitting on sequence and secondary structure

optional arguments:

-h, --help show this help message and exit

-i INFILE, --infile INFILE

Path to file to open

3. $ python3 secondary_structure_splitter.py usage: secondary_structure_splitter.py [-h] -i INFILE secondary_structure_splitter.py: error: the following arguments are required: -i/--infile

4. $ python3 secondary_structure_splitter.py --infile ss_designed2Fail.txt Header and Sequence lists size are different in size Did you provide a FASTA formatted file?

You must implement the following functions. Name the functions exactly as instructed below, and provide the same arguments and call them in the same context as instructed. Failure to do so will result in points being deducted. You can use default or non-defualt variables.

1. Write a function (call it get_filehandle) that receives two arguments: 1). A file name 2). How to open a file for reading or writing ("r", or "w") in Python. The purpose of this function is to open the file name passed in, and passes back a a file object, aka the file handle or handle. You can call this function like this: fh_in = get_filehandle(file_to_open, "r") or fh_out = get_filehandle(file_to_write, "w") When using open(), make sure to use try .. except .. except if the open was not successful the program should raise the right Exception(it should raise an OSErrorLinks to an external site. for when the file cannot be opened, and raise a ValueErrorLinks to an external site. when the wrong argument was passed for the opening mode, e.g. "rrr" instead of "r"). We will test for things like a file that does not exist for opening, or the wrong open mode, e.g. mode='rrr'. See openLinks to an external site. for more information. All opening and closing of files in your program should use the get_filehandle function. Failure to do so will loose points. Make sure to close your file handle.

2. Write a function (call it get_fasta_lists) that receives one argument: 1). A file handle to the fasta file used in this program. The function will return two lists. One lists for the sequences in the file and one list for the headers to the sequences in the file. There should be a one-to-one correspondence to the data in the lists. Meaning element 1 of the header list should correspond to element 1 of the sequence list. If implemented correctly, you can call this function like this: # send off the filehandle # get back data in lists list_headers, list_seqs = get_fasta_lists(fh_in) This function should exit if it could not successfully get two lists of equal size (see _verify_lists below). The function _verify_lists does the actual exiting and printing of the message, but that will get called to by the get_fasta_lists function. This function can be tested by a unit test. Make sure "newline " characters have been removed from your sequence data. Note, remove newline characters, not spaces, since the secondary structure string might contain spaces!

3. Write a function (call it _verify_lists [Note the "_" in front of the name of the function]) that receives two arguments: 1). The header list found in step 2 directly above. 2). The sequence list found in step 2 directly above. This is a helper function that will be called in the get_fasta_lists function (which is why it starts with a "_" in front of the name, since it's not really to be called in the main part of your program, i.e the Single Pre Underscore is only meant to use for the internal use). If the sizes of the lists passed into this function are not the same, it should exit, (telling the user why it exited - see output when running: python3 secondary_structure_splitter.py --infile ss_designed2Fail.txt), else it returns (return True). If implemented correctly, you can call this function like this: # check to make sure data looks good. Note, no return value, so no assignment _verify_lists (list_headers, list_seqs)