Answered step by step
Verified Expert Solution
Link Copied!

Question

1 Approved Answer

Please explain as best you can how to get the answer(with theory) and I will give you a thumbs up. Don't simply use chat GPT.

Please explain as best you can how to get the answer(with theory) and I will give you a thumbs up. Don't simply use chat GPT.image text in transcribed

a) Fig. Q4 shows an encoding layer of a Transformer for machine translation. The row vector xi is the word-embedding of the i-th word in a sentence. Given {x1,x2} as input, the single-head self-attention layer outputs the row vectors {z1,z2}. Denote X=[x1x2] as the input matrix. The outputs of the self-attention layer is given by Z=[z1z2]=Softmax(dkQK)V, where dk is a normalizing constant and Q=XWQ,K=XWK, and V=XWV are called the query, key, and value matrices, respectively. The matrices WQ,WK, and WV are trained by the backpropagation algorithm. Fig. Q4 (i) Show that WQ and WK will become irrelevant to the self-attention layer's outputs if WQWK is an identity matrix. (6 marks) (ii) If multi-head attention is to be used, WQ,WK, and WV should not be identity matrices. Briefly explain the reason

Step by Step Solution

There are 3 Steps involved in it

Step: 1

blur-text-image

Get Instant Access to Expert-Tailored Solutions

See step-by-step solutions with expert insights and AI powered tools for academic success

Step: 2

blur-text-image

Step: 3

blur-text-image

Ace Your Homework with AI

Get the answers you need in no time with our AI-driven, step-by-step assistance

Get Started

Students also viewed these Databases questions