Question

1 Approved Answer

Posted on Sep 24, 2024

Lexical analysis is the process of converting a sequence of characters (such as a string) into a sequence of tokens (smaller strings, substrings, that have

Lexical analysis is the process of converting a sequence of characters (such as a string) into a sequence of tokens (smaller strings, substrings, that have an identified "meaning"). A program that performs the lexical analysis is called a tokenizer. A tokenizer is usually paired with a parser, which together analyzes the syntax of the string in accordance with the particular programming language being used. Parsing is analyzing the string within the context of the particular computer language being used to find substrings that are meaningful. Your assignments will not include writing a parser. A string tokenizer allows an application to break a string into tokens. A token, as explained above, is a word within a string that may or may not have meaning when it is analyzed by a parser. A stream tokenizer takes an input stream and parses it into tokens. The stream tokenizer recognizes identifiers, numbers, quoted strings, and various comment styles. Each character is characterized as white space, alphabetic, numeric, quote, or comment character. Each character can have none or more of these characteristics. Since stream tokenizers are often used as the first step in parsing computer programs, they usually have several options related to processing or ignoring certain characters, depending on the programming languages particular rules.

These options include the following: Whether to treat line breaks as token delimiters or whitespace (e.g., line breaks in VisualBasic indicate the end of a statement; in C++, they are ignored) Whether C-style comments are tokenized or skipped Whether C++-style comments are tokenized or skipped Whether keywords and names of identifiers should be converted to lowercase (e.g., C++ names and keywords are case-sensitive; SQLs are not)

Using Java, C#, or another object-oriented language of your choice, write a stream tokenizer method with the following signature: String[] tokenize(Stream in, bool tokenizeAtLineBreaks, bool ignoreCComments, bool ignoreCppComments) Some important issues to remember are listed below: If one or both of the comments flags is set to TRUE, the entire comment should be treated as a single token. In other words, do not treat whitespace within the comment as a token delimiter. For example, the following lines of code each contain 3 tokens (highlighted) int age; /* This is the persons age in years */ String name; // This is the persons name If one or both of the comment flags is set to FALSE, the comment should be ignored and not returned or processed by the tokenizer. For example, if the ignoreCComments flag is set to FALSE, then the following line of code only has 2 tokens (the comment is skipped/ignored): int age; /* This is the persons age in years */ And the following line of code has only 2 tokens if the ignoreCppComments is set to FALSE (the comment is skipped/ignored): String name; // This is the persons name If you encounter a double quotation marks character ("), you are within a string literal token. Do not separate the words within the string literal into separate tokens. For example, the following code segment contains 4 tokens (highlighted): String univName = "Northcentral University";