Answered step by step
Verified Expert Solution
Question
1 Approved Answer
Please explain as best you can how to get the answer(with theory) and I will give you a thumbs up. Don't simply use chat GPT.
Please explain as best you can how to get the answer(with theory) and I will give you a thumbs up. Don't simply use chat GPT.
a) Fig. Q4 shows an encoding layer of a Transformer for machine translation. The row vector xi is the word-embedding of the i-th word in a sentence. Given {x1,x2} as input, the single-head self-attention layer outputs the row vectors {z1,z2}. Denote X=[x1x2] as the input matrix. The outputs of the self-attention layer is given by Z=[z1z2]=Softmax(dkQK)V, where dk is a normalizing constant and Q=XWQ,K=XWK, and V=XWV are called the query, key, and value matrices, respectively. The matrices WQ,WK, and WV are trained by the backpropagation algorithm. Fig. Q4 (i) Show that WQ and WK will become irrelevant to the self-attention layer's outputs if WQWK is an identity matrix. (6 marks) (ii) If multi-head attention is to be used, WQ,WK, and WV should not be identity matrices. Briefly explain the reasonStep by Step Solution
There are 3 Steps involved in it
Step: 1
Get Instant Access to Expert-Tailored Solutions
See step-by-step solutions with expert insights and AI powered tools for academic success
Step: 2
Step: 3
Ace Your Homework with AI
Get the answers you need in no time with our AI-driven, step-by-step assistance
Get Started