Answered step by step

Verified Expert Solution

Link Copied!

Question

1 Approved Answer

Posted on Sep 29, 2024

Sometimes to process DNA data files a consensus sequence is created when there are many similar DNA sequences in a file. A consensus sequence is

Sometimes to process DNA data files a consensus sequence is created when there are many similar DNA sequences in a file. A consensus sequence is obtained by choosing the most highly-occurring nucleotide in each position. Consider the following example:

Sequence 1: GATCAGCTAG

Sequence 2: AATCCGATCG

Sequence 3: AATGCGCTAG

Sequence 4: ACTCTGCGTG

Consensus: AATCCGCTAG

The nucleotides in the first column are G, A, A, A, A. Since A is the highest occurring nucleotide in position 1 (from left to right), thats the nucleotide that will be used in position 1of the consensus sequence. In general, the i^th character of the consensus sequence is the highest occurring nucleotide of the i^th column of the sequences.

Your task is to create a Python program that reads DNA sequences from an input file (below is a file example illustrating how the data is organized) and generates the consensus sequence. Additionally, an output file could be created that stores the nucleotide frequencies for each position, so as to help determine whether the consensus is indeed an accurate representation of the sequences.

Input file: DNA strings to be processed are to be read from a file named DNAInput.txt. The files have the following format: Description line, sequence line, description line, sequence line, and so on. Heres a sample file:

>biological_description_1

GATCAGCTAG

>biological_description_2

AATCCGATCG

>biological_description_3

AATGCGCTAG

>biological_description_4

ACTCTGCGTG and so on

Description lines always start with the > character; you may disregard these lines.

Note that a FASTA file is a plain text file, except that the file extension is either .fa or .fasta. (i.e. read the file in the same way you would read a .txt file)

Output file: You will store the consensus sequence and the frequencies of the nucleotides in a file called DNAOutput.txt. For the sample input file provided above, heres what the output file would contain:

Consensus: AATCCGCTAG

Pos 1: A:3 G:1 C:0 T:0

Pos 2: A:3 C:1 G:0 T:0

Pos 3: T:4 A:0 C:0 G:0

Pos 4: C:3 G:1 A:0 T:0

Pos 5: C:2 A:1 T:1 G:0

Pos 6: G:4 A:0 C:0 T:0

Pos 7: C:3 A:1 G:0 T:0

Pos 8: T:3 G:1 A:0 C:0

Pos 9: A:2 C:1 T:1 G:0

Pos 10: G:4 A:0 C:0 T:0

Note that the nucleotide sequences listed for each column are in non-increasing order by frequency. In case of a tie (when 2 different nucleotides have the same frequency) it doesnt matter which one comes first in the output. For example, the last line in the previous example could have also been:

Pos 10: G:4 C:0 T:0 A:0

You may assume that:

Every combination of description+sequence takes up 2 lines (1 line for each).

All sequences in the file have the same length. The exact length is not initially known; you may determine it from any of the sequences.

All nucleotides are in capital letters.

There will be no characters other than A, C, T, and G in the sequences.

There will be no ties for the most highly-occurring nucleotide in any column. This means that, when determining the consensus, there will be a single nucleotide that is the highest occurring.

You may NOT assume that:

The length of the DNA sequences will be 10, as in the example.

The amount of DNA sequences will be 4, as in the example.

Your code ( you just need to create blocks of code there is not need to create function, we wil review the concepts of functions later) :

Create a function called load_data.

It takes as argument the name of the file to be used (a string).

It returns a data structure (or more than one) that contains all of the information from the input file.

Create a function called count_nucl_freq.

It takes as argument the data structure(s) generated by load_data.

Create a new data structure (or more than one) that contains the frequencies of the nucleotides for each column in each sequence.

Create a function called find_consensus.

It takes data structure(s) generated by count_nucl_freq.

It creates a string; the consensus sequence.

Other Important information

Sample files are provided, but they are for testing purposes only. In other words, the sample DNAOutput.txt provided should be the result of executing your program with the sample file provided (DNAInput.fasta). Your program should be able to work with any FASTA file where all sequences are of the same length.

You should NOT prompt for the file name; you should ALWAYS try to open a file named DNAInput.fasta and your output should ALWAYS be to a file named DNAOutput.txt.