This assignment is writing a python program on Linux for a parser for HTTP GET to read in lines of characters from standard input (i.e., the keyboard) and determine which lines, if any, are legal HTTP GET requests. Guidelines are attached below. For further clarification, read the "The Assignment" section.
The following is the test code you will use to test against your own written code:
MakeInput1.py
import sys sys.stdout.write('GET /index.html HTTP/1.0') sys.stdout.write('GET /index2.html HTTP/1.0') sys.stdout.write('GET /cat.png HTTP/1.0')
MakeOutput1.py
import sys sys.stdout.write('GET /index.html HTTP/1.0 ') sys.stdout.write('Method = GET ') sys.stdout.write('Request-URL = /index.html ') sys.stdout.write('HTTP-Version = HTTP/1.0 ') sys.stdout.write(' ') sys.stdout.write(' ') sys.stdout.write(' ') sys.stdout.write('COMP431
') sys.stdout.write(' ') sys.stdout.write(' ') sys.stdout.write('GET /index2.html HTTP/1.0 ') sys.stdout.write('Method = GET ') sys.stdout.write('Request-URL = /index2.html ') sys.stdout.write('HTTP-Version = HTTP/1.0 ') sys.stdout.write('404 Not Found: /index2.html ') sys.stdout.write('GET /cat.png HTTP/1.0 ') sys.stdout.write('Method = GET ') sys.stdout.write('Request-URL = /cat.png ') sys.stdout.write('HTTP-Version = HTTP/1.0 ') sys.stdout.write('501 Not Implemented: /cat.png ')
Introduction-Mini-steps Towards the Construction of a Web Server This assignment is a simple string-parsing and file-I/O problem whose solution we will build on in later assignments to build a simple web server that will work with standard browsers such as Internet Explorer. At a high-level, a web server is simply a program that receives commands from clients (most likely web browsers), processes the commands, and sends the results of the processing back to the client as a response to the command In this abstract view of a web server, a server is a program that executes a logically infinite loop wherein it receives and processes commands. In this first assignment you will develop a portion of the code that will be used by the web server to process commands it receives. Specifically, you are to write a program to determine if a command (a text string) received by a server is a valid HTTP GET command, and if so, to read the file requested from disk The HTTP GET Command An HTTP GET command is simply a line of input that looks like the following: GET /home/public/hw1/index.html HTTP/1.0 The GET command (more commonly referred to as a "request") is made up of three substrings: a "method"- the word GET (in upper case characters), a request URL"-a set of words separated by slashes (up') (taken together, the words and slashes are interpreted as a file name), and an "HTTP version identifier" - a string of the form "HTTP/x.y" where x and y are digits All of these components must appear in the order listed above and in a single character string terminated by a carriage-return-line-feed combination. Components may be separated in this string by any amount of "whitespace" (spaces and tabs) The HTTP GET request is part of the larger HTTP protocol. Protocols such as HTTP are typically specified more formally than the English description above by using a more precise specification notation. (These notations are, in essence, a textual form of the syntax diagrams - sometimes called "railroad diagrams"that are used to specify the formal syntax of a programming language.) For example, the formal definition of the HTTP GET "request line" is given by the following grammer: Request-Line Method WhSp Request-URL WhSp HTTP-Version *Space CRLF As an aside, this form of notation is a variation of a commonly used notation called Backus-Naur Form (BNF). You will often see the syntax of protocols expressed using BNF and variations on BNF. Method = GET" Request-URLAbsolute-Path HTTP-version = "HTTP " /" +DIGIT *FileNameChar ALPHA-UPALPHALOALPHA ." +DIGIT Absolute-Path /" - FileNamechar ALPHA I DIGIT I ." I-" I /" DIGIT
UPALPHA- LOALPHA = WhSp+Space space = (SP | HT) CRLF SP HT = In this notation Items appearing on the left-hand side of an expression are called tokens Tokens written in all uppercase characters are called terminals and represent a single character in the string; tokens with lowercase characters are called non-terminals and represent substrings, Anything in quotes is interpreted as a literal string or character that must appear exactly as written, Text in angle brackets (") is interpreted as an English textual description of required text, Parenthesis are a grouping operator The vertical bar i is interpreted as an or" operator and indicates a choice between components that are mutually exclusive, The plus +" denotes that one or more of the item following the plus sign must appear in sequence, and The asterisk * denotes that any nurnber of the item following the asterisk (including 0) may appear in sequence . . For example, the GET request above conforms to the formal description and hence is a valid HTTP GET request (assuming it is terminated with a carriage-return-line-feedthe line termination "character" for UNIX). The following strings do not conform to the formal description and would be rejected as invalid or illegal requests get /jasleen/public_html/Courses/Spring07 HTTP/1.0 GET /jasleen/public html/Courses/Spring07 HTTP/1.0 GET /jasleen/public_html/Courses/Spring07 HTTP /1.0 The first request contains an invalid Method token (the "get" is not in upper case); the second and third requests contain an invalid HTTP-Version token. In the case of the second request, note that for most parses of the request string, the substring /jas1een/public" would be returned as a valid Absolute-Path and hence the next token searched for would be the HTTP-Version token. That is, having found the Absolute-Path'jasleen/public," a parser would next try and interpret the substring "html/Courses/Spring07" as the HTTP-Version token. Thus although the error in the request was white space appearing in the file name, the error manifests itself in the parse as an invalid HTTP-Version token. In the case of the third request, the space following the HTTP string is not allowed in the HTTP-Version token The Assignment -A Parser for HTTP GET For this assignment you are to write a Python program on Linux to read in lines of characters from standard input (i.e., the keyboard) and determine which lines, if any, are legal HTTP GET requests. For each legal line, the program will read from the disk the file requested and print it on standard output ERROR -- Spurious token before CRLF . For the purpose of deciding which is the first token that has error, assume tokens are delimited by , , or . You should check for syntax errors as well as ordering errors in the order in which commands are input and emit error messages as appropriate. All acknowledgement and error messages should be formatted exactly as shown above. When an error is encountered, skip all input till the next closest or or (remember that both and produce a newline character) Valid Input Processing If the request is parsed without syntax errors, the program will do the following. If the filename ends in either of the strings ".txt", ".htm", or ".html" then your program should attempt to open the specified file, read successive lines from the file, and output these lines to standard output (i.e., the screen). (For this assignment you may assume that any file having a file name ending with any of the above extensions contains only ASCII text lines with the normal line termination character sequences.) The test for the file extension should be case insensitive and hence any uppercase or upper/lower case variant of the above file extensions is acceptable. For example, for the GET request 4 GET /foo/bar.html HTTP/1.0 your program would open the file foo/bar.html and write the contents of the file to standard output The file name represented by the Request-URL is to be interpreted relative to the current directory in which your prograrn is executing. That is, for the Request-URL foo/bar.html, your program should attempt to open the file foo/bar.html in the current working directory If the file name does not end in one of the extension strings listed above (or has no extension), the following error message should be output to standard output 501 Not Implemented: where where "" is the error message string of the Python IOError class where "" is the error message string of the Python IOError class. All output should be written to standard output (ie., to the Linux window in which you entered the command(s) to execute your program). Your program should format its output exactly as shown above. Your program should terminate when it reaches the end of the input file (when control-D is typed from the keyboard under UNIX). Your program must not output any user prompts, debugging information, status messages, extra white spaces, etc. Your outputs will be graded by comparing a pre-generated output and any of these "extras" that your program outputs will incur a penalty in the grade. The purpose of this assignment is to get up to speed with the Python programming language and the use of Linux program development tools. Note that, in the abstract, this assignment has nothing to do with networking and is just a simple text parsing problem. This assignment is, however, a useful exercise in writing protocol parsers that must adhere to standards strictly Introduction-Mini-steps Towards the Construction of a Web Server This assignment is a simple string-parsing and file-I/O problem whose solution we will build on in later assignments to build a simple web server that will work with standard browsers such as Internet Explorer. At a high-level, a web server is simply a program that receives commands from clients (most likely web browsers), processes the commands, and sends the results of the processing back to the client as a response to the command In this abstract view of a web server, a server is a program that executes a logically infinite loop wherein it receives and processes commands. In this first assignment you will develop a portion of the code that will be used by the web server to process commands it receives. Specifically, you are to write a program to determine if a command (a text string) received by a server is a valid HTTP GET command, and if so, to read the file requested from disk The HTTP GET Command An HTTP GET command is simply a line of input that looks like the following: GET /home/public/hw1/index.html HTTP/1.0 The GET command (more commonly referred to as a "request") is made up of three substrings: a "method"- the word GET (in upper case characters), a request URL"-a set of words separated by slashes (up') (taken together, the words and slashes are interpreted as a file name), and an "HTTP version identifier" - a string of the form "HTTP/x.y" where x and y are digits All of these components must appear in the order listed above and in a single character string terminated by a carriage-return-line-feed combination. Components may be separated in this string by any amount of "whitespace" (spaces and tabs) The HTTP GET request is part of the larger HTTP protocol. Protocols such as HTTP are typically specified more formally than the English description above by using a more precise specification notation. (These notations are, in essence, a textual form of the syntax diagrams - sometimes called "railroad diagrams"that are used to specify the formal syntax of a programming language.) For example, the formal definition of the HTTP GET "request line" is given by the following grammer: Request-Line Method WhSp Request-URL WhSp HTTP-Version *Space CRLF As an aside, this form of notation is a variation of a commonly used notation called Backus-Naur Form (BNF). You will often see the syntax of protocols expressed using BNF and variations on BNF. Method = GET" Request-URLAbsolute-Path HTTP-version = "HTTP " /" +DIGIT *FileNameChar ALPHA-UPALPHALOALPHA ." +DIGIT Absolute-Path /" - FileNamechar ALPHA I DIGIT I ." I-" I /" DIGIT UPALPHA- LOALPHA = WhSp+Space space = (SP | HT) CRLF SP HT = In this notation Items appearing on the left-hand side of an expression are called tokens Tokens written in all uppercase characters are called terminals and represent a single character in the string; tokens with lowercase characters are called non-terminals and represent substrings, Anything in quotes is interpreted as a literal string or character that must appear exactly as written, Text in angle brackets (") is interpreted as an English textual description of required text, Parenthesis are a grouping operator The vertical bar i is interpreted as an or" operator and indicates a choice between components that are mutually exclusive, The plus +" denotes that one or more of the item following the plus sign must appear in sequence, and The asterisk * denotes that any nurnber of the item following the asterisk (including 0) may appear in sequence . . For example, the GET request above conforms to the formal description and hence is a valid HTTP GET request (assuming it is terminated with a carriage-return-line-feedthe line termination "character" for UNIX). The following strings do not conform to the formal description and would be rejected as invalid or illegal requests get /jasleen/public_html/Courses/Spring07 HTTP/1.0 GET /jasleen/public html/Courses/Spring07 HTTP/1.0 GET /jasleen/public_html/Courses/Spring07 HTTP /1.0 The first request contains an invalid Method token (the "get" is not in upper case); the second and third requests contain an invalid HTTP-Version token. In the case of the second request, note that for most parses of the request string, the substring /jas1een/public" would be returned as a valid Absolute-Path and hence the next token searched for would be the HTTP-Version token. That is, having found the Absolute-Path'jasleen/public," a parser would next try and interpret the substring "html/Courses/Spring07" as the HTTP-Version token. Thus although the error in the request was white space appearing in the file name, the error manifests itself in the parse as an invalid HTTP-Version token. In the case of the third request, the space following the HTTP string is not allowed in the HTTP-Version token The Assignment -A Parser for HTTP GET For this assignment you are to write a Python program on Linux to read in lines of characters from standard input (i.e., the keyboard) and determine which lines, if any, are legal HTTP GET requests. For each legal line, the program will read from the disk the file requested and print it on standard output ERROR -- Spurious token before CRLF . For the purpose of deciding which is the first token that has error, assume tokens are delimited by , , or . You should check for syntax errors as well as ordering errors in the order in which commands are input and emit error messages as appropriate. All acknowledgement and error messages should be formatted exactly as shown above. When an error is encountered, skip all input till the next closest or or (remember that both and produce a newline character) Valid Input Processing If the request is parsed without syntax errors, the program will do the following. If the filename ends in either of the strings ".txt", ".htm", or ".html" then your program should attempt to open the specified file, read successive lines from the file, and output these lines to standard output (i.e., the screen). (For this assignment you may assume that any file having a file name ending with any of the above extensions contains only ASCII text lines with the normal line termination character sequences.) The test for the file extension should be case insensitive and hence any uppercase or upper/lower case variant of the above file extensions is acceptable. For example, for the GET request 4 GET /foo/bar.html HTTP/1.0 your program would open the file foo/bar.html and write the contents of the file to standard output The file name represented by the Request-URL is to be interpreted relative to the current directory in which your prograrn is executing. That is, for the Request-URL foo/bar.html, your program should attempt to open the file foo/bar.html in the current working directory If the file name does not end in one of the extension strings listed above (or has no extension), the following error message should be output to standard output 501 Not Implemented: where where "" is the error message string of the Python IOError class where "" is the error message string of the Python IOError class. All output should be written to standard output (ie., to the Linux window in which you entered the command(s) to execute your program). Your program should format its output exactly as shown above. Your program should terminate when it reaches the end of the input file (when control-D is typed from the keyboard under UNIX). Your program must not output any user prompts, debugging information, status messages, extra white spaces, etc. Your outputs will be graded by comparing a pre-generated output and any of these "extras" that your program outputs will incur a penalty in the grade. The purpose of this assignment is to get up to speed with the Python programming language and the use of Linux program development tools. Note that, in the abstract, this assignment has nothing to do with networking and is just a simple text parsing problem. This assignment is, however, a useful exercise in writing protocol parsers that must adhere to standards strictly