Answered step by step

Verified Expert Solution

Link Copied!

Question

1 Approved Answer

Posted on Sep 22, 2024

Develop a program that parses Apache Common Log Format log files and produces a report. The data for each of three reports is to be

Develop a program that parses Apache Common Log Format log files and produces a report. The data for each of three reports is to be stored in one of three counting Hashes, which are then used to generate the output for the three reports.

The default layout for an Apache access log is defined by httpd.conf directives as follows:

LogFormat "%h %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\""

where:

%h is the remote host (i.e., the client IP)

%l is the identity of the user determined by identd (not usually used since not reliable)

%u is the user name determined by HTTP authentication (not used in the sample)

%t is the time the server finished processing the request including the GMT offset

%r is the full HTTP request line from the client. ("GET / HTTP/1.0")

%>s is the status code sent from the server to the client (200, 404 etc.)

%b is the size of the response to the client (in bytes).

%{Referer} is the referring page (e.g., the home page that has a hyperlink)

%{UserAgent} is the browser itself, acting on behalf of the user

An example log file ~mccordt/access_log is provided on students.cs.nku.edu which you should use for testing purposes. First, make a copy of that file under your home directory. If you have access to an Apache web server with more realistic data, feel free to test with the access_log(s) from that server as well. You can edit it if you like to facilitate testing. For instance, one might introduce a malformed line to see what happens. That's up to you.

A typical line in the sample file will look like this:

192.168.1.142 - - [30/Sep/2016:16:19:18 -0400] "GET / HTTP/1.1" 403 4897 "-" "Mozilla/5.0 (Windows NT 10.0; WOW64; rv:49.0) Gecko/20100101 Firefox/49.0"

The dashes are used as placeholders for unpopulated fields.

To parse such a line into fields, we first need to read the file to get the lines. We will learn more about File I/O later in the course, but for now, we only need to know how to use the File class to read the file and let it store the lines of the file as an Array of Strings. Try the following commands in IRB with the access_log file in the same directory from which you started IRB and you will see how this works:

line_array = File.readlines("access_log")

p line_array.size

line_array.each { |line| p "Line: #{line}" }; p # see the NOTE below

NOTE: The added p command in the last line serves to give IRB something to evaluate other than the Array itself and is not needed in your actual program. It simply makes the output more readable when using IRB.

The pathname of the log file will be a variable in your program set to "access_log" unless you wish to test with your own log file, in which case you would set the name to that. Later, we will learn how to pass the file name as a command-line argument.

At this point, you should have an Array containing the lines of the file as String elements, which you might print for debugging before trying to go further.

The next step is to iterate over the line Array and parse the line into fields. [Again, you might just write the iterator block and see if you can display each line before adding more code. ]

At that point, you are ready to process each line. You must use one or more regular expressions to obtain the fields from the line. For this assignment, you must use the Regexp#match method and the MatchData object to get the fields. Use the following IRB commands to get an idea of how to use groups in MatchData objects to select certain fields. One Regexp per field is acceptable, in which case each MatchData object would only have one group, or you may use three groups in one Regexp. The example has two groups.

>> line = '192.168.1.142 - - [30/Sep/2016:16:19:18 -0400] "GET /index.html HTTP/1.1" 403 4897 "-"'

=> "192.168.1.142 - - [30/Sep/2016:16:19:18 -0400] \"GET /index.html HTTP/1.1\" 403 4897 \"-\""

>> md = line.match(/^([:\d\.]+) .*\[.*\].*\"[A-Z]+ *(.+) HTTP/)

=> #1:"192.168.1.142" 2:"/index.html">

>> p md[1]

"192.168.1.142"

=> "192.168.1.142"

>> p md[2]

"/index.html"

=> "/index.html"

HINT: The first group in the Regexp above is "match any of the literal :, any digits, or literal periods at least one time" that is followed by a white space and then any characters up to the literal [. This allows it to capture IPv4 or IPv6 addresses. ::1 is the IPv6 loopback address.

NOTE: If you cannot figure out how to extract the "clean" fields with a Regexp, you may remove unwanted characters from individual fields using the String#gsub method on the field after generating the MatchData object but before converting the Strings to Symbols to use as Hash keys. Use the Ruby docs @ http://ruby-doc.org/core-2.3.4/ to assist you with String#gsub.

Before storing the fields in their respective Hashes, convert them to Symbols. Because some fields contain non-alphanumeric characters, Ruby may retain the quotation marks. For example, "hello" as a Symbol is just :hello but "192.168.1.1" as a Symbol is :" 192.168.1.1"

When you display the Symbols with puts or print, the " marks are suppressed In any case, so we need not worry about them being there. The required result is that the IP addresses, URLs, and error codes need to match the look of the sample output below. If you don't see quotation marks in the sample output, you shouldn't see them in your output.

This version of the program is hard-coded to produce all three sub-reports every time it is run. Later, we will learn to use additional command-line options to choose one report or another.

Your program must generate the following report with three specific sub-reports. The output should follow this format. You may use the output below to test that your program is counting correctly, but the actual data is truncated for the URL report.

Start with a descriptive header followed by at least one blank line, using the variable for the file name:

----------------------------------------------------

Statistics for the Apache log file access_log

-----------------------------------------------------

The first sub-report is a histogram of IP addresses (in a field of 20 spaces) using the asterisk (*) character, with a short header to explain what the data represent. For example:

Frequency of Client IP Addresses:

::1 *******

192.168.1.142 ************************************

192.168.1.1 *************

192.168.1.34 ****

192.168.1.127 ***************

192.168.1.56 *********

192.168.1.138 *********

192.168.1.156 ***************

The second sub-report is a table showing the number of times each unique URL appears in the file with a short header. Note that the forward slashes (/) are part of the URL patterns and that patterns such as /javajam4 are distinct from /javajam4/. (We are tracking what end-users type in the browser, not the number of pages in the site.) Use printf to display the URL in a consistent field width such that the totals are aligned as shown below. You may have to experiment with different widths to find a spacing that works for all URLs in the file.

Frequency of URLs Accessed:

* 7

/ 10

/noindex/css/bootstrap.min.css 1

/noindex/css/open-sans.css 1

/images/apache_pb.gif 1

/images/poweredby.png 1

/noindex/css/fonts/Bold/OpenSans-Bold.woff 3

/noindex/css/fonts/Light/OpenSans-Light.woff 3

/noindex/css/fonts/Light/OpenSans-Light.ttf 3

/noindex/css/fonts/Bold/OpenSans-Bold.ttf 3

/favicon.ico 4

/index.html 2

/first 1

... and so on ...

The last sub-report is a list of HTTP status codes (sorted by the code, not the percentage), with percentages of URLs that resulted in each status code, which should be displayed as shown. The percentage is based on the count of each unique code divided by the total number of lines in the file, which you should derive from the size of the Array of lines.

HINT: One or both of the numbers must be converted to a Float using to_f to force the result to be a decimal number. Once you have a decimal result, use round() to get two digits to the right of the decimal point.

HTTP Status Codes Summary:

200: 45.37%

403: 11.11%

404: 35.19%

301: 5.56%

304: 2.78%