Answered step by step
Verified Expert Solution
Link Copied!

Question

1 Approved Answer

Transformers models use a hybrid approach between word - level and character - level tokenization called subword tokenization. BPE ( Byte - Pair Encoding )

Transformers models use a hybrid approach between word-level and character-level tokenization called subword
tokenization. BPE(Byte-Pair Encoding) is a subword-level tokenization approach introduced in Neural Machine
Translation of Rare Words with Subword Units (Sennrich et al.,2015).. BPE relies on a pre-tokenizer that
splits the training data into words. Pretokenization can be as simple as space tokenization. Let us assume that
after pre-tokenization, the following set of words including their frequency has been determined:
(old,10),(older,5),(oldest,8),(hug,8),(pug,4),(hugs,5)
We obtain an base vocabulary:
o,l,d,e,r,s,t,h,u,g,p
Splitting all words into symbols in the base vocabulary, we obtain:
(o,l,d,10),(o,l,d,e,r,5),(o,l,d,e,s,t,8),(h,u,g,8),(p,u,g,4),(h,u,g,s,5)
BPE then counts the frequency of each possible symbol pair and picks the symbol pair that occurs most frequently.
In the above example, "o" followed by "l" is present 10+5+8=23 times. Thus, the first merge rule the tokenizer
learns is to group all "o" symbols followed by an "l" symbol together. Next, "ol" is added to the vocabulary.
The set of words then becomes:
(ol,d,10),(ol,d,e,r,5),(ol,d,e,s,t,8),(h,u,g,8),(p,u,g,4),(h,u,g,s,5)
This process will run iteratively. The vocabulary size, i.e. the base vocabulary size + the number of merges,
is a hyperparameter to choose. The learned merge rules would then be applied to new words (as long as those
new words do not include symbols that were not in the base vocabulary). The word not in the base vocabulary
would be repersented as "[unk]". Implement this BPE tokenizer, set the vocabulary size as 16 and train this
BPE tokenizer to finish the iterative proecss. Use the trained tokenizer to tokenize the words below:(15 marks)
{hold,oldest,older,pug, mug, huggingface}.
image text in transcribed

Step by Step Solution

There are 3 Steps involved in it

Step: 1

blur-text-image

Get Instant Access to Expert-Tailored Solutions

See step-by-step solutions with expert insights and AI powered tools for academic success

Step: 2

blur-text-image

Step: 3

blur-text-image

Ace Your Homework with AI

Get the answers you need in no time with our AI-driven, step-by-step assistance

Get Started

Recommended Textbook for

More Books

Students also viewed these Databases questions

Question

Explain the internal business process perspective.

Answered: 1 week ago