Question

1 Approved Answer

Posted on Aug 04, 2024

python http://www.cse.msu.edu/~cse231/Online/Projects/Project06/ C.elegans.gff file link- http://www.cse.msu.edu/~cse231/Online/Projects/Project06/C.elegans.gff C.elegans_small.gff file link - http://www.cse.msu.edu/~cse231/Online/Projects/Project06/C.elegans_small.gff CSE 231 Spring 2018 Programming Project 06 Edit 2/21: removed one line from Function

python

http://www.cse.msu.edu/~cse231/Online/Projects/Project06/ image text in transcribed

image text in transcribed

C.elegans.gff file link- http://www.cse.msu.edu/~cse231/Online/Projects/Project06/C.elegans.gff

C.elegans_small.gff file link - http://www.cse.msu.edu/~cse231/Online/Projects/Project06/C.elegans_small.gff

CSE 231 Spring 2018 Programming Project 06 Edit 2/21: removed one line from Function Test 2 to improve clarity This assignment is worth 45 points and must be completed and turned in before 11:59 on Monday, February 26, 2018 Assignment Overview This assignment will give you more experience on the use of: 1. Lists and tuples 2. function 3. File manipulation The goal of this project is to extract gene lengths from a gene annotation file. With a gene annotation GFF file, you will need to extract the gene coordinates on each chromosome and calculate the average and standard deviation of gene lengths Assignment Background The eukaryotic genome is composed of multiple chromosomes. On each chromosome, there are multiple genes. In bioinformatics, the genome annotations can be saved in a file format called GFF. In NCBI genome database (https://www.ncbi.nlm.nih.gov/genome/), there are many publically available annotated organisms. These annotated genomes can be downloaded in multiple file formats, including GFF format. For this project, we will focus on a relatively simple model species: Caenorhabditis elegans. This worm has a genome of six chromosomes named chrI, chrlI, chrIII, chrIV, chrV, and chrX We provide two input files: C.elegans-small .gff C.elegans.gff # a small file for development # a real BIG data file Project Description a) open file ) prompts the user to enter a filename. The program will try to open a tab- separated GFF file (a text file). An error message should be shown if the file cannot be opened. This function will loop until it receives proper input and successfully opens the file. It returns a file pointer b) read file (fp) receivers a file pointer of the data file and read all the genes information. For this project, we are only interested in the following columns: the chromosome name (string) is in column 0, the gene_start is in column 3, and the gene end is in column 4. Convert number strings to int. No other values are needed for this project. If a value is missing, use 0 as the value For each gene, save it in a tuple, (chromosome, gene_start, gene_end), and append each tuple to a list of genes. Sort the list and then return the sorted list of genes (sorting makes a canonical list for comparison testing on Mimir) b) extract_chromosome (genes_list, chromosome) receives a list of genes (such as what was returned by the read file() function) and a chromosome name, extract the gene information for this chromosome and save in list chrom gene_list. Sort and return the list (sorting makes a canonical list for comparison testing on Mimir) c) extract_genome (genes_list) receives a list of genes and extract the gene information for each chromosome. In this function, use extract_chromosome(genes_list, chromosome) to extract CSE 231 Spring 2018 Programming Project 06 Edit 2/21: removed one line from Function Test 2 to improve clarity This assignment is worth 45 points and must be completed and turned in before 11:59 on Monday, February 26, 2018 Assignment Overview This assignment will give you more experience on the use of: 1. Lists and tuples 2. function 3. File manipulation The goal of this project is to extract gene lengths from a gene annotation file. With a gene annotation GFF file, you will need to extract the gene coordinates on each chromosome and calculate the average and standard deviation of gene lengths Assignment Background The eukaryotic genome is composed of multiple chromosomes. On each chromosome, there are multiple genes. In bioinformatics, the genome annotations can be saved in a file format called GFF. In NCBI genome database (https://www.ncbi.nlm.nih.gov/genome/), there are many publically available annotated organisms. These annotated genomes can be downloaded in multiple file formats, including GFF format. For this project, we will focus on a relatively simple model species: Caenorhabditis elegans. This worm has a genome of six chromosomes named chrI, chrlI, chrIII, chrIV, chrV, and chrX We provide two input files: C.elegans-small .gff C.elegans.gff # a small file for development # a real BIG data file Project Description a) open file ) prompts the user to enter a filename. The program will try to open a tab- separated GFF file (a text file). An error message should be shown if the file cannot be opened. This function will loop until it receives proper input and successfully opens the file. It returns a file pointer b) read file (fp) receivers a file pointer of the data file and read all the genes information. For this project, we are only interested in the following columns: the chromosome name (string) is in column 0, the gene_start is in column 3, and the gene end is in column 4. Convert number strings to int. No other values are needed for this project. If a value is missing, use 0 as the value For each gene, save it in a tuple, (chromosome, gene_start, gene_end), and append each tuple to a list of genes. Sort the list and then return the sorted list of genes (sorting makes a canonical list for comparison testing on Mimir) b) extract_chromosome (genes_list, chromosome) receives a list of genes (such as what was returned by the read file() function) and a chromosome name, extract the gene information for this chromosome and save in list chrom gene_list. Sort and return the list (sorting makes a canonical list for comparison testing on Mimir) c) extract_genome (genes_list) receives a list of genes and extract the gene information for each chromosome. In this function, use extract_chromosome(genes_list, chromosome) to extract