Question
Purpose : In this exercise, you will write a Lua module that does the lexical analysis. Instructions Write a Lua module lexit, contained in the
Purpose: In this exercise, you will write a Lua module that does the lexical analysis.
Instructions
Write a Lua module lexit, contained in the file lexit.lua. Your module should do lexical analysis; it should be written as a hand-coded state machine.
Be sure to follow the Coding Standards.
- The interface of module lexit is very similar to that of module lexer, which was written in classwith some differences, to be covered shortly. In particular, lexit exports:
- Function lex (i.e., lexit.lex), which takes a string parameter and allows for-in iteration through lexemes in the passed string.
- Numerical constants representing lexeme categories. These constants are shown in the table below.
- Table catnames, which maps lexeme category numbers to printable strings.
- The interface of module lexit differs from that of lexer as follows:
- Lexing is done based on the Lexeme Specification in this document (below), not the one distributed in class.
- The exported numerical constants and category names are different.
- Module lexit should export nothing other than table catnames, seven constants representing lexeme categories, and function lex. You may write anything you want in the source code for the module, as long as it is local and not exported.
The following properties of module lexer should hold for module lexit as well.
- At each iteration, the iterator function returns a pair: a string, which is the string form of the lexeme, and a number representing the category of the lexeme.
- The number mentioned in the previous point is suitable as a key for table catnames.
- The iteration ends when there are no further lexemes.
The correspondence between lexeme category numbers and category names/strings should be as follows.
Category Number | Named Constant | Printable Form |
---|---|---|
1 | lexit.KEY | Keyword |
2 | lexit.ID | Identifier |
3 | lexit.NUMLIT | NumericLiteral |
4 | lexit.STRLIT | StringLiteral |
5 | lexit.OP | Operator |
6 | lexit.PUNCT | Punctuation |
7 | lexit.MAL | Malformed |
Thus, the following code should work.
[Lua]
lexit = require "lexit" program = "x = 3; # Set a variable write(x+4, cr); " for lexstr, cat in lexit.lex(program) do print(lexstr, lexit.catnames[cat]) end
Lexeme Specification
You will write a lexer that is to be part of an interpreter for a programming language called Caracal.
Whitespace characters are blank, tab, vertical-tab, new-line, carriage-return, form-feed. No lexeme, except for a StringLiteral, may contain a whitespace character. So a whitespace character, or any contiguous group of whitespace characters, is generally a separator between lexemes. However, pairs of lexemes are not required to be separated by whitespace.
A comment begins with pound sign (#) occurring outside a StringLiteral lexeme or another comment, and ends at a newline character or the end of the input, whichever comes first. There are no other kinds of comments. Any character at all may occur in a comment.
Comments are treated by the lexer as whitespace: they are not part of lexemes and are not passed on to the caller.
Legal characters outside comments and StringLiteral lexemes are whitespace and printable ASCII characters (values 32 [blank] to 126 [tilde]). Any other characters outside comments and StringLiteral lexemes are illegal.
The maximal-munch rule is followed.
There are seven lexeme categories: Keyword, Identifier, NumericLiteral, StringLiteral, Operator, Punctuation, Malformed.
Keyword
One of the following sixteen:
and char cr def dq elseif else false for if not or readnum return true write
Identifier
A letter or underscore (_), followed by zero or more characters that are all letters, digits, or underscores, and not a keyword.
Here are some Identifier lexemes.
myvar _ ___x_37 HelloThere
Note. The reserved words are the same as the Keyword lexemes.
NumericLiteral
A sequence of one or more digits, possibly followed by an optional exponent.
An exponent is the letter e or E followed by an optional +, and then one or more digits.
Notes. A NumericLiteral must begin with a digit and cannot contain a dot (.). A minus sign is not legal in an exponent. A plus sign is legal, and optional, in an exponent. An exponent must contain at least one digit.
Here are some valid NumericLiteral lexemes.
1234 00900 123e+7 00E00 3e888
The following are not valid NumericLiteral lexemes.
-42 3e e 123E+ 1.23 123e-7
The first string above is an Operator (-) followed by a NumericLiteral (42). The second is a NumericLiteral (3) followed by an Identifier (e). The third is an Identifier (e). The fourth is a NumericLiteral (123), an Identifier (E), and an Operator (+). The fifth is a NumericLiteral (1), a Punctuation (.), and a NumericLiteral (23). The last is a NumericLiteral (123), an Identifier (e), an Operator (-), and a NumericLiteral (7).
StringLiteral
A double quote ("), followed by zero or more characters that are not double quotes or newlines, followed by a double quote. There are no escape sequences. Any character, legal or illegal, other than a newline, may appear in a StringLiteral. The beginning and ending quote marks are both part of the lexeme.
Here are some StringLiteral lexemes.
"xy" "x'y" "abc ###\"
Operator
One of the following fourteen:
== != < <= > >= + - * / % [ ] =
Punctuation
Any single legal character that is not whitespace, not part of a comment, and not part of any valid lexeme in one of the other categories, including Malformed.
Here are some Punctuation lexemes.
; ( ) { } ,
Malformed
There are two kinds of Malformed lexemes: bad character and bad string.
A bad character is any single character that is illegal, that is not part of a comment or a StringLiteral lexeme that began earlier.
A bad string is essentially a partial StringLiteral where the end of the line or the end of the input is reached before the ending quote mark. It begins with a double quote mark that is not part of a comment or StringLiteral that began earlier, and continues to the next newline or the end of the input, without a double quote appearing. Any character, legal or illegal, may appear in a bad string. If the lexeme ends at a newline, then this newline is not part of the lexeme.
Here are three Malformed lexemes that are bad strings.
"a-b-c "wx yz "'
In order to be counted as Malformed. each of the above four must end at a newline (which would not be considered part of the lexeme) or at the end of the input.
Note. The two kinds of Malformed lexemes are presented to the caller in the same way: they are both simply Malformed.
Step by Step Solution
There are 3 Steps involved in it
Step: 1
Get Instant Access to Expert-Tailored Solutions
See step-by-step solutions with expert insights and AI powered tools for academic success
Step: 2
Step: 3
Ace Your Homework with AI
Get the answers you need in no time with our AI-driven, step-by-step assistance
Get Started