Question

1 Approved Answer

Posted on Sep 22, 2024

BACKGROUNDS OF WORD2VEC (Natural Language Processing) Let's have a quick refresher on the word2vec algorithm. The key insight behind word2vec is that 'a word is

image text in transcribed

BACKGROUNDS OF WORD2VEC (Natural Language Processing) Let's have a quick refresher on the word2vec algorithm. The key insight behind word2vec is that 'a word is known by the company it keeps?. Concretely, suppose we have a 'center' word c and a contextual window surrounding c. We shall refer to words that lie in this contextual window as 'outside words'. For example, in Figure 1 we see that the center word c is 'banking?. Since the context window size is 2, the outside words are 'turning', 'into', 'crises', and 'as'. The goal of the skip-gram word2vec algorithm is to accurately learn the probability distribution P(OC). Given a specific word o and a specific word c, we want to calculate P(O=oC = c), which is the probability that word o is an 'outside' word for c, i.e., the probability that o falls within the contextual window of c. P(Uturning |Vbanking) Puas Vbanking) Pinto Vanking) Perises banking problems turning into banking crises as outside context words center word outside context words in window of size 2 at position t in window of size 2 Figure 1: The word2vec skip-gram prediction model with window size 2 In word2vec, the conditional probability distribution is given by taking vector dot-products and applying the softmax function: exp(utve) P(O=0C = c) = (1) EweVocab exp(u.Tv.) Here, u, is the 'outside' vector representing outside word o, and vc is the 'center' vector representing center word c. To contain these parameters, we have two matrices, U and V. The columns of U are all the outside vectors Uw. The columns of V are all of the 'center' vectors vw. Both U and V contain a vector for every we Vocabulary. Recall from lectures that, for a single pair of words c and o, the loss is given by: Jnaive-softmax (Uc, 0, U) = log P(O = oC = c). (2) Another way to view this loss is as the cross-entropy2 between the true distribution y and the predicted distribution . Here, both y and are vectors with length equal to the number of words in the vocabulary. Furthermore, the kth entry in these vectors indicates the conditional probability of the kth word being an outside word' for the given c. The true empirical distribution y is a one-hot vector with a 1 for the true out- side word o, and 0 everywhere else. The predicted distribution is the probability distribution P(O C = c) given by our model in equation (1). PREVIOUS QUESTIONS' SOLUTION ARE GIVEN: (e) (3 Points) The sigmoid function is given by Equation 4: et 0(2) 1 1+ e* er +1 Please compute the derivative of o(2) with respect to I, where x is a scalar. Hint: you may want to write your answer in terms of o(). Answer: Apply quotient rule f(x) = , f(x) = g/h()+9()h(2) we have: ao (2) ar ac (e" + 1)e - ee (ex + 1)2 21 = =0(1)(+1 111) (ex + 1)2 1 = 0(2) er + 1 = 0(2) -e + en +1 e +1 -e e +1 + e +1 = (x)(1 - (x)) (f) (4 points) Now we shall consider the Negative Sampling loss, which is an alternative to the Naive Softmax loss. Assume that K negative samples (words) are drawn from the vocabulary. For simplicity of notation we shall refer to them as w1, W2, ..., WK and their outside vectors as u1,..., Uk. For this question, assume that the K negative samples are distinct. In other words, i # j implies wi # W; for i, j e {1,...,K}. Note that o {wi, ..., WK}. For a center word c and an outside word o, the negative sampling loss function is given by: Ineg-sample(v,0, U) = log(o(u) v.)) - log(o(-u, v.)) Elosto (5) k=1 for a sample w1,... WK, where 0 () is the sigmoid function." Please repeat parts (b) and (c), computing the partial derivatives of Jneg-sample with respect to vc, with respect to uo, and with respect to a negative sample uk. Please write your answers in terms of the vectors uo, vc, and Uk, where k 1,K). After you've done this, describe with one sentence why this loss function is much more efficient to compute than the naive-softmax loss. Note, you should be able to use your solution to part (e) to help compute the necessary gradients here. Answer: aJ = ((Tv) - 1) aj = (o(uvc) 1)vc, Vk [1,K] duk K aJ auc = (o(utve) 1)u. - (01-21].V.) 1)uk k=1 MY QUESTION: (g) (2 point) Now we will repeat the previous exercise, but without the assumption that the K sampled words are distinct. Assume that K negative samples (words) are drawn from the vocabulary. For simplicity of notation we shall refer to them as w1, W2, ..., WK and their outside vectors as u1,..., UK. In this question, you may not assume that the words are distinct. In other words, Wi = W; may be true when i # j is true. Note that o {W1, ..., WK}. For a center word c and an outside word the negative sampling loss function is given by: 0, K Jneg-sample(vc, 0, U) = log(o(u)v.)) - log(o(-u1 vc)) (6) k=1 for a sample w1, ... WK, where o(-) is the sigmoid function. Compute the partial derivative of Jneg-sample with respect to a negative sample Uk. Please write your answers in terms of the vectors vc and uk, where k (1, K]. Hint: break up the sum in the loss function into two sums: a sum over all sampled words equal to uk and a sum over all sampled words not equal to Uk. BACKGROUNDS OF WORD2VEC (Natural Language Processing) Let's have a quick refresher on the word2vec algorithm. The key insight behind word2vec is that 'a word is known by the company it keeps?. Concretely, suppose we have a 'center' word c and a contextual window surrounding c. We shall refer to words that lie in this contextual window as 'outside words'. For example, in Figure 1 we see that the center word c is 'banking?. Since the context window size is 2, the outside words are 'turning', 'into', 'crises', and 'as'. The goal of the skip-gram word2vec algorithm is to accurately learn the probability distribution P(OC). Given a specific word o and a specific word c, we want to calculate P(O=oC = c), which is the probability that word o is an 'outside' word for c, i.e., the probability that o falls within the contextual window of c. P(Uturning |Vbanking) Puas Vbanking) Pinto Vanking) Perises banking problems turning into banking crises as outside context words center word outside context words in window of size 2 at position t in window of size 2 Figure 1: The word2vec skip-gram prediction model with window size 2 In word2vec, the conditional probability distribution is given by taking vector dot-products and applying the softmax function: exp(utve) P(O=0C = c) = (1) EweVocab exp(u.Tv.) Here, u, is the 'outside' vector representing outside word o, and vc is the 'center' vector representing center word c. To contain these parameters, we have two matrices, U and V. The columns of U are all the outside vectors Uw. The columns of V are all of the 'center' vectors vw. Both U and V contain a vector for every we Vocabulary. Recall from lectures that, for a single pair of words c and o, the loss is given by: Jnaive-softmax (Uc, 0, U) = log P(O = oC = c). (2) Another way to view this loss is as the cross-entropy2 between the true distribution y and the predicted distribution . Here, both y and are vectors with length equal to the number of words in the vocabulary. Furthermore, the kth entry in these vectors indicates the conditional probability of the kth word being an outside word' for the given c. The true empirical distribution y is a one-hot vector with a 1 for the true out- side word o, and 0 everywhere else. The predicted distribution is the probability distribution P(O C = c) given by our model in equation (1). PREVIOUS QUESTIONS' SOLUTION ARE GIVEN: (e) (3 Points) The sigmoid function is given by Equation 4: et 0(2) 1 1+ e* er +1 Please compute the derivative of o(2) with respect to I, where x is a scalar. Hint: you may want to write your answer in terms of o(). Answer: Apply quotient rule f(x) = , f(x) = g/h()+9()h(2) we have: ao (2) ar ac (e" + 1)e - ee (ex + 1)2 21 = =0(1)(+1 111) (ex + 1)2 1 = 0(2) er + 1 = 0(2) -e + en +1 e +1 -e e +1 + e +1 = (x)(1 - (x)) (f) (4 points) Now we shall consider the Negative Sampling loss, which is an alternative to the Naive Softmax loss. Assume that K negative samples (words) are drawn from the vocabulary. For simplicity of notation we shall refer to them as w1, W2, ..., WK and their outside vectors as u1,..., Uk. For this question, assume that the K negative samples are distinct. In other words, i # j implies wi # W; for i, j e {1,...,K}. Note that o {wi, ..., WK}. For a center word c and an outside word o, the negative sampling loss function is given by: Ineg-sample(v,0, U) = log(o(u) v.)) - log(o(-u, v.)) Elosto (5) k=1 for a sample w1,... WK, where 0 () is the sigmoid function." Please repeat parts (b) and (c), computing the partial derivatives of Jneg-sample with respect to vc, with respect to uo, and with respect to a negative sample uk. Please write your answers in terms of the vectors uo, vc, and Uk, where k 1,K). After you've done this, describe with one sentence why this loss function is much more efficient to compute than the naive-softmax loss. Note, you should be able to use your solution to part (e) to help compute the necessary gradients here. Answer: aJ = ((Tv) - 1) aj = (o(uvc) 1)vc, Vk [1,K] duk K aJ auc = (o(utve) 1)u. - (01-21].V.) 1)uk k=1 MY QUESTION: (g) (2 point) Now we will repeat the previous exercise, but without the assumption that the K sampled words are distinct. Assume that K negative samples (words) are drawn from the vocabulary. For simplicity of notation we shall refer to them as w1, W2, ..., WK and their outside vectors as u1,..., UK. In this question, you may not assume that the words are distinct. In other words, Wi = W; may be true when i # j is true. Note that o {W1, ..., WK}. For a center word c and an outside word the negative sampling loss function is given by: 0, K Jneg-sample(vc, 0, U) = log(o(u)v.)) - log(o(-u1 vc)) (6) k=1 for a sample w1, ... WK, where o(-) is the sigmoid function. Compute the partial derivative of Jneg-sample with respect to a negative sample Uk. Please write your answers in terms of the vectors vc and uk, where k (1, K]. Hint: break up the sum in the loss function into two sums: a sum over all sampled words equal to uk and a sum over all sampled words not equal to Uk