Answered step by step
Verified Expert Solution
Link Copied!

Question

1 Approved Answer

Write a C program to find occurrences of a short DNA sequence (pattern) in a longer one (text). The program will accept a single command-line

Write a C program to find occurrences of a short DNA sequence ("pattern") in a longer one ("text"). The program will accept a single command-line argument: the name of a file containing the text. After reading the text, you will read the patterns from the standard input stream (stdin). For each query string, you will print the location of all copies of that query in the longer sequence. The pattern and text strings should be made up of the characters A, C, G and T. For every occurrence of the pattern in the text, the program will print a match record to (stdout).

The text file, whose name is provided as the command-line argument, contains the sequence of the text. The text itself consists only of the characters A, C, G or T. The text file may also contain whitespace characters but those are not part of the text and should be ignored. A text will have at most 15,000 As, Cs, Gs and Ts.

Your program must check for invalid text files. A text file is invalid if:

a non-whitespace character is something other than A, C, G or T, or

the total number of A/C/G/T characters is 0 or is greater than 15,000

In either case, your program should printf("Invalid text "), and return a non-zero value from the main function.

The size of the array/string used to store the text should be set at compile time, and should equal 15,000 (or 15,001 if the code depends on there being a null terminator). Not every text will use the entire array.

For each pattern, the program should print a line of text to the standard output stream (stdout) containing the pattern and the offset(s) where the pattern matches the text separated by spaces as shown in the examples below. Numbering starts at 0. That is, pattern ga occurs in text gaga at offsets 0 and 2. Note also that occurrences can be overlapping; for example, pattern ata occurs in text atata at offsets 0 and 2. If there are no matches for a pattern, print "Not found" after the pattern, as shown below.

Patterns are specified on the standard input stream (stdin) separated by whitespace (any amount). The program should continue looking for patterns until it reaches end-of-input (Ctrl-D if stdin is not redirected).

Your program should check for invalid patterns. A pattern is invalid if:

it contains any characters besides A, C, G, or T, or

its length is greater than the length of the text

In either case, your program should printf("Invalid pattern ") and return a non-zero value from the main function. If the invalid pattern was preceded by valid patterns, the preceding valid patterns should be handled normally.

The program should be case insensitive, treating a, c, g and t the same as A, C, G and T.

All variables must be declared inside functions. No variables should be global or extern.

Sample Runs:

text.txt, example input file: CATATTAC GATTACA Sample run #1, interactive user input. User input is highlighted blue. ./programname text.txt CAT TAC CGA GGG CAT 0 TAC 5 11 CGA 7 GGG Not found Sample run #2, redirected input echo "TTA gc A ATtaC" | ./hw3 text.txt TTA 4 10 GC Not found A 1 3 6 9 12 14 ATTAC 3 9

Step by Step Solution

There are 3 Steps involved in it

Step: 1

blur-text-image

Get Instant Access to Expert-Tailored Solutions

See step-by-step solutions with expert insights and AI powered tools for academic success

Step: 2

blur-text-image

Step: 3

blur-text-image

Ace Your Homework with AI

Get the answers you need in no time with our AI-driven, step-by-step assistance

Get Started

Recommended Textbook for

Database Marketing The Ultimate Marketing Tool

Authors: Edward L. Nash

1st Edition

0070460639, 978-0070460638

More Books

Students also viewed these Databases questions

Question

3. Describe the communicative power of group affiliations

Answered: 1 week ago