Question

1 Approved Answer

Posted on Sep 26, 2024

IN C PROGRAMMING Sample Cumulative Question #5 Note: The following problem is excerpted from the final exam I gave in Summer 2016, and is based

image text in transcribed

IN C PROGRAMMING

Sample Cumulative Question #5 Note: The following problem is excerpted from the final exam I gave in Summer 2016, and is based on a problem I encountered in grad school when I was writing programs to process all of Wikipedia on a system that had only 2 GB of RAM (Wildcard) Suppose you have a corpus file that contains one sentence per line. Some of the sentences might be repeated at various points throughout the file, and you're interested in creating a new copy of the corpus where each unique sentence occurs exactly once (no repeats). Here's the catch: Suppose the corpus is HUGE (several gigs), so you can't fit all the sentences in memory at once, even if you were to use a trie However, you have virtually unlimited hard disk space available to you for reading and writing. 6. The ultimate question here is: How could you write a program in C that would ultimately create a new copy of the corpus where each sentence occurs exactly once? Here are a few additional notes that may or may not be useful in solving this problem: 1. You may assume that each sentence is of a reasonable length (say, no more than 256 characters) 2. If there are multiple occurrences of some sentence in the corpus, those occurrences might not be anywhere near one another. (They could be all over the place!) 3. There are significantly more sentences in the corpus that begin with the words "a," "an," "he," "this," and "these" than any other words in the file. Something to think about. 4. The corpus is at least 20 GB, and you only have 1 GB of RAM (memory) available to you. However, you have virtually unlimited hard disk space for reading and writing 5. Note that when you open a file in C, it doesn't automatically read the whole thing into memory. Even if the corpus is 20 GB, if you only have (for example) a single buffer that you read each sentence into (always overwriting the previous sentence as you go), then you're only taking up space in memory for exactly one sentence at any given time 6. Opening files on disk (for reading or writing) is a slow process. If you have toooo many calls to fopenO, your program will be redonkulously slow In solving this problem, your first priority should be runtime efficiency (so, you want to avoid looping through the entire corpus over and over and over again, and you want to avoid tons and tons and tons and tons and tons of calls to fopen0), but you must also avoid having more than 1 GB of data hanging out in memory at any given time Give an overview of the process by which you would solve this problem. If possible, clearly break it up into steps with brief explanations. If any step requires a particular algorithm or data structure described in class, say so explicitly. Also, give the big-oh runtime for each major step (with a brief justification), as well as a final, comprehensive big-oh runtime for the overall process. Be sure to give definitions for any variables you use in your big-oh runtimes. (One such variable might be the number of sentences in the corpus file. You will likely want to consider how many times each sentence gets processed.) Please just give an overview of your idea; do not write any code, and do not get bogged down in too many implementation details. Focus on high-level overviews and the algorithms and/or data structures involved. Also, looking at the big picture of your overall approach, does it resemble an algorithm or data structure from class? If so, which one? Sample Cumulative Question #5 Note: The following problem is excerpted from the final exam I gave in Summer 2016, and is based on a problem I encountered in grad school when I was writing programs to process all of Wikipedia on a system that had only 2 GB of RAM (Wildcard) Suppose you have a corpus file that contains one sentence per line. Some of the sentences might be repeated at various points throughout the file, and you're interested in creating a new copy of the corpus where each unique sentence occurs exactly once (no repeats). Here's the catch: Suppose the corpus is HUGE (several gigs), so you can't fit all the sentences in memory at once, even if you were to use a trie However, you have virtually unlimited hard disk space available to you for reading and writing. 6. The ultimate question here is: How could you write a program in C that would ultimately create a new copy of the corpus where each sentence occurs exactly once? Here are a few additional notes that may or may not be useful in solving this problem: 1. You may assume that each sentence is of a reasonable length (say, no more than 256 characters) 2. If there are multiple occurrences of some sentence in the corpus, those occurrences might not be anywhere near one another. (They could be all over the place!) 3. There are significantly more sentences in the corpus that begin with the words "a," "an," "he," "this," and "these" than any other words in the file. Something to think about. 4. The corpus is at least 20 GB, and you only have 1 GB of RAM (memory) available to you. However, you have virtually unlimited hard disk space for reading and writing 5. Note that when you open a file in C, it doesn't automatically read the whole thing into memory. Even if the corpus is 20 GB, if you only have (for example) a single buffer that you read each sentence into (always overwriting the previous sentence as you go), then you're only taking up space in memory for exactly one sentence at any given time 6. Opening files on disk (for reading or writing) is a slow process. If you have toooo many calls to fopenO, your program will be redonkulously slow In solving this problem, your first priority should be runtime efficiency (so, you want to avoid looping through the entire corpus over and over and over again, and you want to avoid tons and tons and tons and tons and tons of calls to fopen0), but you must also avoid having more than 1 GB of data hanging out in memory at any given time Give an overview of the process by which you would solve this problem. If possible, clearly break it up into steps with brief explanations. If any step requires a particular algorithm or data structure described in class, say so explicitly. Also, give the big-oh runtime for each major step (with a brief justification), as well as a final, comprehensive big-oh runtime for the overall process. Be sure to give definitions for any variables you use in your big-oh runtimes. (One such variable might be the number of sentences in the corpus file. You will likely want to consider how many times each sentence gets processed.) Please just give an overview of your idea; do not write any code, and do not get bogged down in too many implementation details. Focus on high-level overviews and the algorithms and/or data structures involved. Also, looking at the big picture of your overall approach, does it resemble an algorithm or data structure from class? If so, which one