Question
We want to build a tokenizer for simple expressions such as xpr = res = 3 + x_sum*11. Such expressions comprise only three tokens, as
We want to build a tokenizer for simple expressions such as xpr = "res = 3 + x_sum*11". Such expressions comprise only three tokens, as follows: (1) Integer literals:one or more digits e.g.,3 11; (2) Identifiers:strings starting with a letter or an underscore and followed by more letters, digits, or underscores e.g.,res x_sum; (3) Operators:= + * . Leading or trailing whitespace characters should be skipped.
(a) Write a regular expression patternfor the above grammar and use it with re.findallto split the expression into a list of lexemes. If usingxprabove, the list returned should be:
['res', '=', '3', '+', 'x_sum', '*', '11']
(b) The problem with the above is thatre.findall returns a list of all lexemes butnottheir respective tokens (as defined by the grammar). Modify your regex pattern so that each matched lexeme is in its own (regex) group. The list returned should now be as follows:
[('', 'res', ''), ('', '', '='), ('3', '', ''), ('', '', '+'),
('', 'x_sum', ''), ('', '', '*'), ('11', '', '')]
(c) To find which token each lexeme is associated with, we only need to find the first non-empty item in each tuple. Write a tokenize generator(usingre.findallandmap) that returns all pairs (tuples) of lexemes and tokens. The output oflist(tokenize(xpr))should thus be:
[('res', 'id'), ('=', 'op'), ('3', 'int'), ('+', 'op'),
('x_sum', 'id'), ('*', 'op'), ('11', 'int')]
(d) The above solution works fine if the number of tokens is small, but it will break down when the number increases. Better is to use a feature of the regular expression engine, which is that whenever it completes a matching group, it assigns the group number to an attribute of the match object, calledlastindex. Rewritetokenize to make use of this feature (usingre.matchand repeatedly scanning the match object) and still produce the same output as before.
(e) We can improve the approach further, using another feature which is the scanner()method of regular expressions: It creates a scanner object and attaches it to a string, keeps track of the current position, and moves forward after each successful match. Rewrite the tokenize generator to make use of this feature (usingscanner()and calling match()repeatedly, yielding lexeme and token pairs) and again produce the same output as before.
Step by Step Solution
There are 3 Steps involved in it
Step: 1
Get Instant Access to Expert-Tailored Solutions
See step-by-step solutions with expert insights and AI powered tools for academic success
Step: 2
Step: 3
Ace Your Homework with AI
Get the answers you need in no time with our AI-driven, step-by-step assistance
Get Started