Answered step by step
Verified Expert Solution
Link Copied!

Question

1 Approved Answer

You will write a program to find occurrences of a short DNA sequence (pattern) in a longer one (text). The program will accept a single

You will write a program to find occurrences of a short DNA sequence ("pattern") in a longer one ("text"). The program will accept a single command-line argument: the name of a file containing the text. After reading the text, you will read the patterns from the standard input stream (stdin). For each query string, you will print the location of all copies of that query in the longer sequence. The pattern and text strings should be made up of the characters A, C, G and T. These stand for the nucleobases adenine, cytosine, guanine and thymine. For every occurrence of the pattern in the text, your program will print a match record to (stdout). Work individually on this assignment. The text file, whose name is provided as the command-line argument, contains the sequence of the text. The text itself consists only of the characters A, C, G or T. The text file may also contain whitespace characters (recall the isspace function in ctype.h) but those are not part of the text and should be ignored. A text will have at most 15,000 As, Cs, Gs and Ts. Your program must check for invalid text files. A text file is invalid if: a non-whitespace character is something other than A, C, G or T, or the total number of A/C/G/T characters is 0 or is greater than 15,000 In either case, your program should printf("Invalid text "), and return a non-zero value from the main function. The size of the array/string used to store the text should be set at compile time, and should equal 15,000 (or 15,001 if your code depends on there being a null terminator). Not every text will use the entire array. For each pattern, your program should print a line of text to the standard output stream (stdout) containing the pattern and the offset(s) where the pattern matches the text separated by spaces as shown in the examples below. Numbering starts at 0. That is, pattern ga occurs in text gaga at offsets 0 and 2. Note also that occurrences can be overlapping; for example, pattern ata occurs in text atata at offsets 0 and 2. If there are no matches for a pattern, print "Not found" after the pattern, as shown below. Patterns are specified on the standard input stream (stdin) separated by whitespace (any amount). The program should continue looking for patterns until it reaches end-of-input (ctrl-D if stdin is not redirected). Your program should check for invalid patterns. A pattern is invalid if: it contains any characters besides A, C, G, or T, or its length is greater than the length of the text In either case, your program should printf("Invalid pattern ") and return a non-zero value from the main function. If the invalid pattern was preceded by valid patterns, the preceding valid patterns should be handled normally. The program should be case insensitive, treating a, c, g and t the same as A, C, G and T. All variables must be declared inside functions. No variables should be global or extern.

Sample text file containing:

tcggtacgaaccgcccctgctgccaaacaagcgaatagccctgtcagacccggcattagc cgttctcgggctattagccgacagcagccttcaaggtgtgagtggagctggcaaggtctc tagatgtctttcgagggcctataaagctcgacccgtcctcgtctagaacttcccgggagt tcatcctttcactatctaaggccgacatgaacgacttaatagtagagcattttgtcatcc acagacccaggtactgaccctctttacagccggactaaggtgctgcgttcgcatccgagc ttagtagctccacttggatgagacttaaccccgtacccaaattttataggcacgcgtcct aggtaatcaaagtacttagggaaaaccttcagaacgagtaaatggaattggcatgcctaa catgggtgtgtcttttaggaacagtcttagctatgacatcggccgacagttgcaaatcca ccttgggtaattcaagcaatgtagttgaagtatcacgtctcagctggatcgacattaaca tgcatgattgttctaatacgccgcggacatcggagttggacccgtttctagtgttcatag taagttcaaattgtacctatcgaggaaacttaaatagaattcaatcctcaccgcacgcaa tctcactaggctcgcatttgaatccttgacaatgcgctgagtacctgcgtacgctaaggg atctttacagccgggccgtttcagaatcgtgcagtccactgagcacccagaggacgggtc ggaggctgagtgatcagatgaccaaaagacgattaacgcacgtgaatgaaacagtacacg ttaggttatagggtgccatgtgtcaagctgtttgtttgctcctttacgttggtgtagcta agccgtcactatatgcatatacgcgtctgcaaaaagtaaactgatactgtccccggaaca tacgtgtgagacaaagggttcgttgaaacaaaagaaactcgccacaggatcatatttccc atagaaggacgctggtcacgctgtcccgtcgttggctatggatctttccttgcaaaatag gggtactgttacgttagaacgctgttctgatgcacgcacaaaagggacctcctgtgttag ccttggatgtacgccatcaggcacagtaagcctacattactcgctttcgataccttcttt acttaaactccagctcagagtgccgccgtatgtttcttgtggactttgccattgccgtcc acagctaaccaccctaacatggatagtcgatgtcggccccttgcgggcattacgcgcgtg ggcaagagcctggctctcatcaacacagtcaccaaaacgcggtaatcacggattatctgg atctgtgctaccacaggtgtacatcggcggtatttactcgactacttcgaactacttacc gggggcgttgaatagcaagcctcgctaacgcgatcccttgccaccctgagggaccagatg gcctaacgttcagggcgtccatgatgctgttttaacatcacaaggctccgttttggcagg tcagggaaagggagcgaagtgctacgttacttctgagtgaagctctactaaacaccgcca ctcagcgatattagttttttgtccatactggccatcttcgtgtcaccttgaccgtcttac tagttggctattccaaagattttgttaagtggagtcttgttcggtggccagaaggcgagt ggtagcaaagtcagtttcaagcttgaatcgccttactcggagagagggagatttaactcg cccccgttgagcgtttcgtacgctcctgcgcaatcatgactgggggcatcggcagacctg tactccattgccgaggcggttgatagccttgattgcggcctgctaggccgaattgtcgct tagtaggcgattgaacatagaagaccgtcctcggacgaatcgctgatgtaaggccagctc gcactcaccatctaaagctctttagcgctcaatgccttactcagcaagcgtggcctgttc ccgaagtagagtcactcggcacccgggttgctgagtttcaccaaagaagcattccagaca gagagagaatcccatctgacagttcgattattcagcagcactaaagctgcatggaccaga cgctatcatacctactctagtcgagttggcgcttaacactcaaaatcccagtgtatcttg ttcccaggtaccggctattagatccgcccatgctgatttcaccgcggatgcccatccagg ggccactaatacagtcagttctcggggtaaaacggtaagccatacccttattcatgcagc tcgctattagcaaccgtcaatcgagtgatgaataaataaacgttgttcatcagtaatact ttttgtaactattagtctttgtcctacatgagcgcatggtgaagttgtggaactatgaaa agagtagagggtgcctttccgacttggtactgtgggaaggtgcagacttgaggcccaact gtgtaccagcttcaacgcgtcgagtagcaagctcagacatacccatggatctcttttgga tgtcataacaaattggagatggaagggctggctgggtcagattaatgggttatttcgtta atgctcttcgcggaccgacctgatgcggattaggggttacatgggtagttgtgaattatc tctgagacaaaggcatcgactgcaccttgctgcacgaaacaatacaacggtgtcctgaca ggtcagtgggttggactgaaacaatggctacacggcgggtgaaggagcttgattggcgta ctaaagctcaacggcgaagccggcaggtcattcaaaatgccatcccgtcagggaaaagat tgcgtcagcgcccactcttctcccgcggaagcccggactgagaggaataatcacgaagta ctaagctagcatggaggaacggtaacattggagacgatggattgacgttagcacgttgga acccgcgaggaaactaggaatagcaggggatctctccccgtcttccaatggtcatgccag acccctaagccaactaccaccatacgctgttacgccctggccatcgtcctgttcttacat tggggaccgatatccgactcaatatttatggcgtcgagcgtaaacccaattttcatcgtt gaataaggtgatagccaagaagaactcctgccggtcaaggttgcacatatcacctatgct ctctacccggacccgtcgtctagagctcgcaatccatgggcgcgtacatccctattgaca gtcatatagtgggtagccggtagttacggagcttaatgaggggttactaaccgtctaatt aagtgcaatgccgagcggatgtacatgtccaagtaaccatgcccaatgaaagtgcggacg gatgccagctagttattgcaggtaccacatagaagtccccaattagttgctctattgcat ttttgattctgtatagggcgctttagggctctattcagccgaagagcactgaagcgggaa gcagactatttgaagagccacgcttggccttgggttccaggtccgagttctcgcataggg tctgacctcacgtgcgtaagcaaccgaattctgtcgcgttgtttggcgaataccttgctc ttctcatctagaatgaggagacggactaagggaccccgcaaacgctggcagatctcctaa actgtcagcttttataacactccgtgcgtactctgttggggcgcatgatgatagcaccct caaaaggaataggcgatacggtctctgcaattcagtactagaaggggagccgcgcctcac tatgatcaaaaaccagctcccatccaggagaggccctcctcgccgggggccttccgactt tgctgctaccacgggtggggaaccgcgagacgttccaatatctcggtagtcgactctgtg attacttcggagtagctccgtctgtatccgttaaagaactctagcttaaaaaggacttgc ataggcaatgaacaacttgatgagcgggaaagggtgacggacaattccacatgagtagta tacaccgcgcgggttgctatgtacttaaggcagccgcagcaatccgcaaaattttacacc cccccagattattctacgacgtgcgtcatgaggccattatatcgagtcggattctggcgc aattcgccgaggaacttagcactacccaatcttccgctcaggtcagtctggtgacgaata ctacatcacccgtaagtaaaggagtgaaagtagtcaaacataacagattgttaatatcga ctgatcattgtttcgtccggaaatcaccagtacgccaccgagatcgagcgcggtagcggc ccgagctttttcgtcgtgctcaccccctctagggccccgggaggtggtatgatcgacatg taacgagttgatacccaaatgccgggtgtgcactaagcactgcaaaccgcgggtgagagt gaggttagcagcattaggcctgtaaggccataatgaccaggcgtagcggttcggataggt ttgacttacagactaccaatagtagcagtgtctgtcagtacctcttcggttaaatgcgcg ctactattctttagttgccaattttcagtcttattatgtaattcgactgtcgcctattgg gaagcgtatgctccctgatctggatcagtatatgcgctactgcagaaaccggtcctaaaa tcgaaatgagtgtggcggtccacatagccgcagctcgaggtcgctgacagctagtcgagt gacgaagatatggtagatcgtatacttttaccatctgctacgcctgtcgaaaaaaccaga ttcaagcctacctaaccatgcgacaaacaagatgaacttgggatctcgcatttgtatgcg ggtcgtcatttttcaatccatagttgggtggtaatcgtcttcatacgagctagtgcggaa aaagcacgggcgcttctatcataaacggtgagtagaagcagtttcggttataaccgggcg ggcgtagcggtctatgctagcatggtcacgtcgatgtttatatgggcaaaaggtgtgtac ttatggcctataagcgagttattgggttcaccctcggtaagtacaaaatcaagacggcgt ctgcgaggaaaaatctccttcgacgggcatgcgtccacttgcccactgaattctaggttg tggatccgagtgaggatacacggatagctatgttgggtccagctctcagtattaccacat tctgggg

Step by Step Solution

There are 3 Steps involved in it

Step: 1

blur-text-image

Get Instant Access to Expert-Tailored Solutions

See step-by-step solutions with expert insights and AI powered tools for academic success

Step: 2

blur-text-image

Step: 3

blur-text-image

Ace Your Homework with AI

Get the answers you need in no time with our AI-driven, step-by-step assistance

Get Started

Recommended Textbook for

Database Concepts

Authors: David M. Kroenke

1st Edition

0130086509, 978-0130086501

More Books

Students also viewed these Databases questions

Question

politeness and modesty, as well as indirectness;

Answered: 1 week ago