Answered step by step
Verified Expert Solution
Link Copied!

Question

1 Approved Answer

Question 2 : Parameter - Efficient Transfer Learning [ Jiaoda ] ( 3 0 pts ) Consider a vanilla encoder - decoder transformer [ 2

Question 2: Parameter-Efficient Transfer Learning [Jiaoda](30 pts)
Consider a vanilla encoder-decoder transformer [2].
a)(1 pts) Given the vocabulary size V and embedding dimension D, compute the number of
parameters in an embedding layer (ignore positional encodings).
b)(2 pts) How many embedding layers are there in an encoder-decoder transformer architecture?
What is the total number of parameters in the embedding layers? Is it larger than your answer in
a)? Why, or, why not?
In an encoder layer, there are two sub-layers: a multi-head self-attention mechanism and a position-wise
fully connected feed-forward network. A residual connection is deployed around each sub-layer, followed
by layer normalization.
c)(2 pts) Compute the number of parameters in a multi-head self-attention sub-layer. Write down
all the intermediate steps and assumptions you make.
d)(2 pts) Given that the dimensionality of the intermediate layer is 4D, compute the number of
parameters in a feed-forward network.
e)(1 pts) In a decoder layer, there is an additional sub-layer: multi-head encoder-decoder attention.
Compute the number of parameters in one such sub-layer.
f)(2 pts) There is an output layer made up of a linear transformation and a softmax function that
produces next-token probabilities. Does it introduce extra parameters? Why or why not?
g)(2 pts) Given that both the encoder and the decoder have L layers, compute the total number of
parameters in the transformer.
Consider the adapter network described in [1].
h)(2 pts) Given the bottleneck dimension of the adapter M, compute the number of parameters in
a single adapter module.
i)(2 pts) If we insert an adapter after each sub-layer, how many adapters are inserted in an encoder-
decoder transformer described above? Compute the total number of newly added parameters.
j)(2 pts) If we perform adapter tuning on a downstream binary classification task, what components
are trained? Compute the total number of trainable parameters.
k)(2 pts) Under what condition is adapter tuning more parameter-efficient than fine-tuning?
image text in transcribed

Step by Step Solution

There are 3 Steps involved in it

Step: 1

blur-text-image

Get Instant Access to Expert-Tailored Solutions

See step-by-step solutions with expert insights and AI powered tools for academic success

Step: 2

blur-text-image_2

Step: 3

blur-text-image_3

Ace Your Homework with AI

Get the answers you need in no time with our AI-driven, step-by-step assistance

Get Started

Recommended Textbook for

Expert Oracle9i Database Administration

Authors: Sam R. Alapati

1st Edition

1590590228, 978-1590590225

More Books

Students also viewed these Databases questions

Question

List the requirements for an internetworking facility.

Answered: 1 week ago

Question

What does this key public know about this issue?

Answered: 1 week ago

Question

What is the nature and type of each key public?

Answered: 1 week ago

Question

What does this public need on this issue?

Answered: 1 week ago