Question

1 Approved Answer

Posted on Sep 22, 2024

A sequence contains characters selected from {A,C,G,T}. A short sequence might be something like AAATGCGCGT for example. Our task is to locate a sequence within

A sequence contains characters selected from {A,C,G,T}. A short sequence might be something like AAATGCGCGT for example. Our task is to locate a sequence within another (much larger) sequence. If I search for the sequence CGCG in the above, it is a 100% match starting at location five.

I will have data files that are 1Mbyte (1,048,576 bytes) long, which is the large string. Lets call this the sequence. I will have other files that are 10,240 bytes long, which will contain the DNA we wish to locate, which we will call the subsequence. I will tell you in advance that there is no guarantee of a 100% match on the subsequence. What is the starting byte position, from 0 to the 1,038,336th byte1 that gives the highest count of matched bytes? The slow method to determine this is a nested loop: for( each starting position 0..1,038,336 ) for( each byte 0..10,239 ) if there is a match increment a counter and the largest match wins. Did I mention that this is the slow method? It is an O(N2) algorithm, which is terrible. For our purposes it will work, but real sequence searches use faster algorithms. We want to execute this in parallel.

OK so heres what I want. You must write this program in either C or C++. Divide the work into N processes. Each process will calculate 1/Nth of the work. The value for N will come from the command line, along with the name of the sequence file and the subsequence file, in that order. For example, suppose my program is named findDNA: $ ./findDNA 12 seqfile subseqfile Looking for string using 12 processes Best match is at position 123456 with 9876/10240 correct.

1 This is the 1Mbyte minus the 10,240. You can stop early for this assignment when you get to this position.

As good programmers you should be checking the command line parameters for validity and not just blowing up if things are wrong. Instead of using pthread_create() use the fork() system call. Instead of pthread_join() use waitpid(). o Be very careful about this. For example, how many processes do you think this code starts? for( i = 0; i < 4; i++ ) pid[ i ] = fork(); o Answer: The parent creates four children. Each child does i++ and since 1 < 4, they each create three grandchildren. Each of these does i++ and since 2 < 4, these new ones create two great-grandchildren. These great-grandchildren each do i++ and since 3 < 4 they each create one great-great-grandchild. Correct me if Im wrong, but thats a lot more than four. Instead of PTHREAD mutex calls, use the POSIX calls sem_wait() and sem_post() instead. The semaphore will need to be initialized using sem_init(). Instead of merely declaring a global which is accessible to all the threads, you need to read the data into shared memory prior to calling fork(). There is sample code below. The shared memory will contain the sequence and the subsequence, and also additional pieces of information: o The current longest match. o The starting position of the current longest match. o The semaphore mentioned above. Each process will start looking at byte i and increment by N, so process three out of 10, for example, will compare at 3, 13, 23, Once the comparison is done, counting the number of correct DNA nucleotides2 out of the 10,240, you will need to get access using the semaphore to the current longest match. If your match is longer, you update the longest match and the position, then unlock the semaphore. If not, you just unlock it. Regardless, you dont want to be checking and possibly updating the data simultaneously with other processes or bad things might happen. Last time, you were required to bring in the POSIX thread library by using the -pthread command line argument to the compiler. For shared memory, you need to use -lrt instead. I want the C or C++ code and a Makefile which recompiles the code.

I have begin the project but I donot know where to go from here.

#include #include #include #include #include #include #include

#define MY_SHARED_MEMORY "/7998smani"

struct DNA { sem_t lock; char seq[1048576]; char sub[10240]; int bestmatch; }; int main(int argc, char *argv[]) { FILE *fp1, *fp2; int filecmp(FILE *, FILE *); fp1 = fopen (argv[2], "r"); //printf("File Opened. "); fp2 = fopen (argv[3], "r"); filecmp(fp1, fp2); printf ("Count is: %d ", filecmp(fp1,fp2)); fclose(fp1); fclose(fp2); //printf("File closed. ");

int shm_fd = shm_open( MY_SHARED_MEMORY, O_CREAT | O_RDWR, S_IRWXU ); if ( shm_fd < 0 ) { perror( "shm_open" ); exit( 1 ); }

size_t region_size = 1024 * 1024; // 1M byte

if ( ftruncate( shm_fd, region_size ) == -1 ) { perror( "ftruncate" ); exit( 1 ); }

char *the_data = (char *) mmap( 0, region_size, PROT_READ | PROT_WRITE, MAP_SHARED, shm_fd, 0 ); if ( the_data == MAP_FAILED ) { perror( "mmap" ); exit( 1 ); }

//printf( "Ha! It worked! My shared memory is at %p ", the_data );

return( 0 );

} int filecmp(FILE *fp1, FILE *fp2) { int f1, f2, count; while(1) { if((f1= getc(fp1)) ==EOF) break; if((f2= getc(fp2)) ==EOF) break; if (f1 == f2) { count++; } } return count; }