Answered step by step
Verified Expert Solution
Question
1 Approved Answer
Transformers models use a hybrid approach between word - level and character - level tokenization called subword tokenization. BPE ( Byte - Pair Encoding )
Transformers models use a hybrid approach between wordlevel and characterlevel tokenization called subword
tokenization. BPEBytePair Encoding is a subwordlevel tokenization approach introduced in Neural Machine
Translation of Rare Words with Subword Units Sennrich et al BPE relies on a pretokenizer that
splits the training data into words. Pretokenization can be as simple as space tokenization. Let us assume that
after pretokenization, the following set of words including their frequency has been determined:
oldolderoldesthugpughugs
We obtain an base vocabulary:
Splitting all words into symbols in the base vocabulary, we obtain:
BPE then counts the frequency of each possible symbol pair and picks the symbol pair that occurs most frequently.
In the above example, o followed by l is present times. Thus, the first merge rule the tokenizer
learns is to group all o symbols followed by an l symbol together. Next, ol is added to the vocabulary.
The set of words then becomes:
This process will run iteratively. The vocabulary size, ie the base vocabulary size the number of merges,
is a hyperparameter to choose. The learned merge rules would then be applied to new words as long as those
new words do not include symbols that were not in the base vocabulary The word not in the base vocabulary
would be repersented as unk Implement this BPE tokenizer, set the vocabulary size as and train this
BPE tokenizer to finish the iterative proecss. Use the trained tokenizer to tokenize the words below: marks
holdoldest,older,pug, mug, huggingface
Step by Step Solution
There are 3 Steps involved in it
Step: 1
Get Instant Access to Expert-Tailored Solutions
See step-by-step solutions with expert insights and AI powered tools for academic success
Step: 2
Step: 3
Ace Your Homework with AI
Get the answers you need in no time with our AI-driven, step-by-step assistance
Get Started