Question

1 Approved Answer

Posted on Sep 24, 2024

N-grams on a small corpus Pick two paragraphs from any text of your liking, a newspaper article, a textbook, a letter, an email. Each paragraph

image text in transcribed

N-grams on a small corpus Pick two paragraphs from any text of your liking, a newspaper article, a textbook, a letter, an email. Each paragraph should contain at least 4 sentences and a total of at least 30 words (meant as tokens, not as types). We will call them Paragraph A and Paragraph B. You can ignore punctuation within sentences; use the full stop or the semicolon as end of sentence; assume a beginning of sentence symbol 1. First consider Paragrah A as your training corpus, and Paragraph B as your testing corpus. Given the bigram probabilities you would obtain given Paragraph A, is there a sentence in Paragraph B whose bigram probability is different from zero? If there is such sentence, compute its bigram probability according to the language model you derive from Paragraph A (you don't need to compute the full language model, only the probabilities that you need for this sentence). If there is no such sentence, explain which bigram probabilities are zero for each of the sentences in Paragraph B; find the longest sequence of words in Paragraph B which has a bigram probability different from zero, according to the language model from Paragraph A, and compute it (again, you don't need to compute the full language model). 1 2. Now considering both paragraphs together as your training corpus, write a grammatical English sentence of at least 4 words that did not occur in either paragraph, and that has probability different from zero according to a unigram model, but equal to zero according to a bigram model (a) Compute the probability of the sentence given a unigram model (compute and include all the relevant unigram probabilities) (b) Show which bigram probability(ies) will be zero (c) Compute the smoothed Laplace probabilities for all the bigram(s) that occur in the sentence, and compute the final bigram Laplace probability for the sentence. Submit the original two paragraphs (your corpus), and answers to all the questions above. You also need to include all the probability computations you performed: not just the final number, you will have to show all the probabilities of individual unigrams and bigrams, i.e., the subset of the language model that you need to use for those specific sentences N-grams on a small corpus Pick two paragraphs from any text of your liking, a newspaper article, a textbook, a letter, an email. Each paragraph should contain at least 4 sentences and a total of at least 30 words (meant as tokens, not as types). We will call them Paragraph A and Paragraph B. You can ignore punctuation within sentences; use the full stop or the semicolon as end of sentence; assume a beginning of sentence symbol 1. First consider Paragrah A as your training corpus, and Paragraph B as your testing corpus. Given the bigram probabilities you would obtain given Paragraph A, is there a sentence in Paragraph B whose bigram probability is different from zero? If there is such sentence, compute its bigram probability according to the language model you derive from Paragraph A (you don't need to compute the full language model, only the probabilities that you need for this sentence). If there is no such sentence, explain which bigram probabilities are zero for each of the sentences in Paragraph B; find the longest sequence of words in Paragraph B which has a bigram probability different from zero, according to the language model from Paragraph A, and compute it (again, you don't need to compute the full language model). 1 2. Now considering both paragraphs together as your training corpus, write a grammatical English sentence of at least 4 words that did not occur in either paragraph, and that has probability different from zero according to a unigram model, but equal to zero according to a bigram model (a) Compute the probability of the sentence given a unigram model (compute and include all the relevant unigram probabilities) (b) Show which bigram probability(ies) will be zero (c) Compute the smoothed Laplace probabilities for all the bigram(s) that occur in the sentence, and compute the final bigram Laplace probability for the sentence. Submit the original two paragraphs (your corpus), and answers to all the questions above. You also need to include all the probability computations you performed: not just the final number, you will have to show all the probabilities of individual unigrams and bigrams, i.e., the subset of the language model that you need to use for those specific sentences