Answered step by step
Verified Expert Solution
Question
1 Approved Answer
Goal: The goal of this part of the project is to implement in Java or Python, a Tokenizer for the Core language. Although it
Goal: The goal of this part of the project is to implement in Java or Python, a Tokenizer for the Core language. Although it is called a tokenizer, as we saw in class, what you will implement is a Scanner which will read one line at a time from the input file, tokenize that line, and make those tokens available to the parser via the methods listed below. Once those tokens have all been consumed by the parser, the Scanner will read the next line, tokenize that line, etc. The approach used for tokenization in your Scanner should be based on the DFA for Core as we have discussed in class. The methods that the Scanner should provide are dictated by the needs of the Parser, in particular the One- Token-Look-Ahead (OTLA) approach used by the Parser. OTLA means that there are times when some method in the parser needs to know what the next token is although it is not yet ready to "consume" it. We will see the details when we discuss the parser. In these situations, if the next token is an identifier, we do not need to know the name of the identifier, only the fact that the next token is an identifier. Similarly, if the next token is an integer, we do not need to know the actual value of the integer, only the fact that the next token is an integer. But when the Parser is ready to consume the identifier or the integer token, it will need the name of the particular identifier or the value of the particular integer. The methods listed below, to be implemented in your Core Scanner, are designed to account for these factors. In addition, they are designed so that the grader can easily grade your Scanner implementation. Of course, in addition to these methods which constitute the interface of the Scanner, you may always implement any private methods you want inside the Scanner. The set of tokens of Core are as specified on slide 14 of the file part2.pptx. As listed on slide 14, the legal tokens of the Core language are: Reserved words (11 reserved words): program, begin, end, int, if, then, else, while, loop, read, write Special symbols (19 special symbols): = ! [ ] && || ( ) + * ! = Integers (unsigned) Identifiers: start with uppercase letter, followed by zero or more uppercase letters and zero or more digits. Note something like "ABCend" is illegal as an id because of the lowercase letters; and it is not two tokens because of lack of whitespace. But ABC123 and A1B2C3 are both legal. White space requirement: White space, i.e., one or more blanks or tab characters or carriage returns, is required between any pair of tokens unless one or both of them is/are special symbols. If one or both of them is/are special symbols. white space is optional. Note that white space is not a token. As we discussed in class, we will number these tokens 1 through 11 for the reserved words, 12 through 30 for the special symbols, 31 for integer, and 32 for identifier. Note that 31 just tells us that the token in 1 question is an integer, not the actual value of the integer. Similarly, 32 just tells us that the particular token is an identifier, not the name of the identifier. We will also use two additional numbers, 33 and 34, to indicate not actual tokens but situations that the Scanner has to deal with. We will use 33 to indicate that the Scanner is at the end-of-file so there are no more tokens. And 34 will indicate that the Scanner came across a string of characters in the input stream that is not a legal token, in other words, an error token. The Scanner's interface should consist of the following (public) methods: Constructor (): The Scanner's constructor should receive, as parameter, the name of the input file that contains the string of characters to be tokenized. The constructor should start by instantiating a bufferedreader corresponding to that file in the usual manner. The constructor should then call a private method tokenizeLine (). This method will read a line from the inpuut file and convert the sequence of characters into a sequence of Core tokens and save them in a private data structure, possibly an array. It should then set a cursor index that points to the first token in that structure. It would be useful to include, in the same structure, the string of characters making up each token since you will need that for the case of integer tokens and identifier tokens. If a line that tokenizeLine () reads consists entirely of white space characters, it should skip that line and read the next line skipping as many lines as necessary to get to a non-blank line. If the line contains a string of characters that is not a valid token, tokenizeLine () should still tokenize up to that point with the next number in its private data structure being 34 to indicate illegal token. tokenizeLine () also has to take account of greedy tokenizing. int getToken (): This returns a number between 1 and 32 if the current token is a proper Core token; 33 if the Scanner is at the end of the file and there are no more tokens; 34 if the Scanner comes across a string that is not a legal token. Note that getToken () does not move the "cursor" forward. So repeated calls to get Token () will return the same value. To move past the current token, the Parser has to call skipToken (). void skipToken (): This method moves the "token cursor" to the next token (but does not return anything). If there are no more tokens in the current line, skipToken () calls tokenizeLine() which will read the next line from the input file and convert the sequence of characters into a sequence of Core tokens and save them in the Scanner's data structure and set the cursor index to point to the first token and return. If, when skipToken () is called, the current token is 33, i.e., we are already at the end-of-file, the cursor is not moved since there are no more lines or tokens to read. Similarly, if the current token is 34, i.e., we have an illegal token, the cursor is not moved since we do not go past an illlegal token. int intVal (): If get Token () returns 31 to indicate that the current token is an integer, we may call intVal () to get the value of that integer. If we call intVal () when the current token is not an integer, it will print an error message and the program will terminate. string idName (): If get Token () returns 32 to indicate that the current token is an identifier, we may call idName () to get the name of that identifier. If we call idName () when the current to- ken is not an identifier, it will print an error message and the program will terminate. Both intVal () and idName () will make use of the 2 Using the DFA: Where does the DFA for Core tokens enter the picture? Standard tokenizers use a table-driven approach in which the DFA, including all the transitions, indications of which states are accepting final states and what tokens they correspond to, etc., are represented in a table. Greedy tokenizing is also represented in the table. The scanner itself is written as a driver program that uses the table to decide what to do when it is in any particular state and for any given character that might be next on the input stream. But we didn't discuss this in class so you do not need to follow this approach. The book does contain a brief description, in Section 2.2.3, of the approach and it also includes pseudo-code corresponding to the approach. I would recommend reading that before deciding whether to follow this approach in your implementation or do it some other way. If you decide to follow this approach, you can simply convert the book's pseudo-code into Python or Java, depending on which one you are using; you would also have to come up with the table corresponding to your DFA for Core. Your main () function should receive the name of the input file as a command line argument. It should then call the constructor of Scanner, passing that name as an argument. main () should then repeatedly call get Token (), print the returned token number, and call skipToken () until, after some number of iterations, get Token () returns 33 to indicate end-of-file or 34 to indicate invalid token. In either case, it should print an appropriate message and terminate. main () should print each token number on a separate line. Note that main () never calls intVal () or idName (). These methods will be used in the parser. Suppose the input file contains the following Core program: program int X; begin X=25; write X; end your main () function should produce the following sequence of numbers: 1 4 32 12 2 32 ... 33 corresponding to the tokens, "program", "int", "X", ";", "begin", "X", ..., EOF, but with each number being on a separate line. Note that, although the example above is a legal Core program, the Scanner should not worry about whether the input stream is a legal Core program or not. That will be the job of the parser. All that your Scanner should care about is that each token in the input stream is a legal token. Of course, if the Scanner comes across an illegal token in the input stream, as described above, it should print an appropriate error message and stop. It should not try to continue beyond that point. Additional Notes: 1. The tokens must be read line-by-line and printed. You should not read the entire stream of tokens in one fell-swoop before starting to print them. If there is an error in reading a token, such as a misspelled keyword, all the tokens prior to that token must be printed out before the error message for the bad token is printed. Then the Tokenizer must terminate. 2. You may use either Java or Python. Do NOT use any other language. Do NOT use the Tokenizer libraries of those languages or others that you may find online. Do NOT use any regular expression package that may be available either as part of the standard libraries or that you might come across online. 3. Submit a .zip file on Carmen that includes the following: 3
Step by Step Solution
★★★★★
3.34 Rating (148 Votes )
There are 3 Steps involved in it
Step: 1
Answer import javaio public class Scanner private BufferedReader reader private String tokens privat...Get Instant Access to Expert-Tailored Solutions
See step-by-step solutions with expert insights and AI powered tools for academic success
Step: 2
Step: 3
Ace Your Homework with AI
Get the answers you need in no time with our AI-driven, step-by-step assistance
Get Started