Question
Hey! Could you build a lexical analyzer in c++ for the small programming language called Simple Perl-Like. The syntax is given below in EBNF notations,
Hey! Could you build a lexical analyzer in c++ for the small programming language called Simple Perl-Like. The syntax is given below in EBNF notations, in addition to requirements for both programs. The lex.h header file is also provided at the bottom. 1. Prog ::= StmtList 2. StmtList ::= Stmt ;{ Stmt; } 3. Stmt ::= AssignStme | WriteLnStmt | IfStmt 4. WriteLnStmt ::= WRITELN (ExprList) 5. IfStmt ::= IF (Expr) { StmtList } [ ELSE { StmtList } ] 6. AssignStmt ::= Var = Expr 7. Var ::= NIDENT | SIDENT 8. ExprList ::= Expr { , Expr } 9. Expr ::= RelExpr [(-eq|==) RelExpr ] 10. RelExpr ::= AddExpr [ ( -lt | -gt | < | > ) AddExpr ] 11. AddExpr :: MultExpr { ( + | - | .) MultExpr } 12. MultExpr ::= ExponExpr { ( * | / | **) ExponExpr } 13. ExponExpr ::= UnaryExpr { ^ UnaryExpr } 14. UnaryExpr ::= [( - | + )] PrimaryExpr 15. PrimaryExpr ::= IDENT | SIDENT | NIDENT | ICONST | RCONST | SCONST | (Expr)
Based on the language definitions, the lexical rules of the language and the assigned tokens to the terminals are as follows: 1. The language has general identifiers, referred to by IDENT terminal, which are defined as a word that starts by a letter or an underscore _, and followed by zero or more letters, digits, or underscores _ characters. Note that all identifiers are case sensitive. It is defined as: IDENT := [Letter _] {( Letter | Digit | _ )} Letter := [a-z A-Z] Digit := [0-9] 2. The language variables are either numeric scalar variables or string scalar variables. Numeric variables start by a $ and followed by an IDENT. While a string variable starts by @ and followed by an IDENT. Their definitions are as follows: NIDENT := $ IDENT SIDENT := @ IDENT 3. Integer constant is referred to by ICONST terminal, which is defined as one or more digits. It is defined as: ICONST := [0-9]+ 4. Real constant is a fixed-point real number referred to by RCONST terminal, which is defined as one or more digits followed by a decimal point (dot) and zero or more digits. It is defined as: RCONST := ([0-9]+)\.([0-9]*) For example, real number constants such as 12.0, and 0.2, 2. are accepted as real constants, but .2, and 2.45.2 are not. Note that .2 is recognized as a dot (CAT operator) followed by the integer constant 2. 5. String literals is referred to by SCONST terminal, which is defined as a sequence of characters delimited by single quotes, that should all appear on the same line. For example, Hello to CS 280. is a string literal. While, Hello to CS 280. Or Hello to CS 280. are not. 6. The reserved words of the language are: writeln, if, else. These reserved words have the following tokens, respectively: WRITELN, IF, ELSE. 7. The operators of the language are: +, -, *, /, ^, =, (, ), {, }, ==, >, <, . (dot), ** (repeat), -eq, - lt, and -gt. These operators are for add, subtract, multiply, divide, exponent, assignment, left parenthesis, right parenthesis, numeric equality, numeric greater than, numeric less than, string concatenation, string repetition, string equality, string less-than, and string greater-than operations, respectively. They have the following tokens, respectively: PLUS, MINUS, MULT, DIV, EXPONENT, ASSOP, NEQ, NGTHAN, NLTHAN, CAT, SREPEAT, SEQ,
SLTHAN, and SGTHAN. Note that the string comparison operators -eq, -lt, and -gt are not case sensitive. 8. The semicolon, comma, left parenthesis, right parenthesis, left braces, and right braces characters are terminals with the following tokens: SEMICOL and COMMA, LPAREN, RPAREN, LBRACES, and RBRACES, respectively. 9. A comment is defined by all the characters following the characters # to the end of line. A recognized comment is skipped and does not have a token. 10. White spaces are skipped. However, white spaces between tokens are used to improve readability and can be used as a one way to delimit tokens. 11. An error will be denoted by the ERR token. 12. End of file will be denoted by the DONE token. Lexical Analyzer Requirements: A header file, lex.h, is provided for you. It contains the definitions of the LexItem class, and an enumerated type of token symbols, called Token, and the definitions of three functions to be implemented. These are: extern ostream& operator<<(ostream& out, const LexItem& tok); extern LexItem id_or_kw(const string& lexeme, int linenum); extern LexItem getNextToken(istream& in, int& linenum); You MUST use the header file that is provided. You may NOT change it. I. You will write the lexical analyzer function, called getNextToken, in the file lex.cpp. The getNextToken function must have the following signature: LexItem getNextToken (istream& in, int& linenumber); The first argument to getNextToken is a reference to an istream object that the function should read from. The second argument to getNextToken is a reference to an integer that contains the current line number. getNextToken should update this integer every time it reads a newline from the input stream. getNextToken returns a LexItem object. A LexItem is a class that contains a token, a string for the lexeme, and the line number as data members. Note that the getNextToken function performs the following:
1. Any error detected by the lexical analyzer should result in a LexItem object to be returned with the ERR token, and the lexeme value equal to the string recognized when the error was detected. 2. Note also that both ERR and DONE are unrecoverable. Once the getNextToken function returns a LexItem object for either of these tokens, you shouldnt call getNextToken again. 3. Tokens may be separated by spaces, but in most cases are not required to be. For example, the input characters 3+7 and the input characters 3 + 7 will both result in the sequence of tokens ICONST PLUS ICONST. Similarly, The input characters Hello World, and the input characters HelloWorld will both result in the token sequence SCONST SCONST. II. You will implement the id_or_kw() function. Id_or_kw function accepts a reference to a string of a general identifier lexeme (i.e., keyword, IDENT, SIDENT, or NIDENT) and a line number and returns a LexItem object. It searches for the lexeme in a directory that maps a string value of a keyword to its corresponding Token value, and it returns a LexItem object containing the keyword Token if it is found. Otherwise, it returns a LexItem object containing a token for one of the possible types of identifiers (i.e., IDENT, SIDENT, or NIDENT). III. You will implement the overloaded function operator<<. The operator<< function accepts a reference to an ostream object and a reference to a LexItem object, and returns a reference to the ostream object. The operator<< function should print out the string value of the Token in the tok object. If the Token is either an IDENT, NIDENT, SIDENT, ICONST, RCONST, SCONST, it will print out its token followed by its lexeme between parentheses. See the example in the slides.
The header file lex.h:
#ifndef LEX_H_ #define LEX_H_ #include
bool operator==(const Token token) const { return this->token == token; } bool operator!=(const Token token) const { return this->token != token; } Token GetToken() const { return token; } string GetLexeme() const { return lexeme; } intGetLinenum() const { return lnum; } }; extern ostream& operator<<(ostream& out, const LexItem& tok); extern LexItem id_or_kw(const string& lexeme, int linenum); extern LexItem getNextToken(istream& in, int& linenum); #endif /* LEX_H_ */
Step by Step Solution
There are 3 Steps involved in it
Step: 1
Get Instant Access to Expert-Tailored Solutions
See step-by-step solutions with expert insights and AI powered tools for academic success
Step: 2
Step: 3
Ace Your Homework with AI
Get the answers you need in no time with our AI-driven, step-by-step assistance
Get Started