Answered step by step
Verified Expert Solution
Question
1 Approved Answer
Problem Statement: Let us assume that you have a web server or application that appends a line to a log file every time it serves
Problem Statement: Let us assume that you have a web server or application that appends a line to a log file every time it serves a request. Some examples of lines in the log file are as follows (two lines of the log file are shown here (the format of the input log file). 199.72.81.55 - - [01/Jul/1995:00:00:01 -0400] "GET /history/apollo/ HTTP/1.0" 2006245 unicomp6.unicomp.net - - [01/Jul/1995:00:00:06 -0400] "GET/shuttle/countdown/ HTTP/1.0" 2003985 Here is the meaning of the fields of the above log lines. That is what is typically called metadata (or columns here). Note that unknown is the column that can ignore for this assignment. And note that content_size is an int type, i.e., number of bytes. host, unknown1, unknown2, timestamp, method, url, version, response_code, content_size Prerequisites To do the data analysis, the first step is to implement a function that reads in the provided log file and stores the data in a Pandas Dataframe. Notes: - Make sure the names and order of the columns follow the metadata above. - (Hint) Store column names/headers to the Dataframe correctly could make the following tasks easier. - (Hint) Avoid storing incorrect column names, missing column names, or storing column names in a different order. - Using the two example log lines given above, it should generate a Pandas Dataframe looks like this: Problem A (8 pts) Write functions that answer the following questions: 1. Total number of distinct HTTP response codes 2. Median content_size - Hint: use numpy to find median - Note: need to ignore values in the content_size column that are not a number - If the median number is a float number, you will need to cast it into an integer number - We need type-casting, so avoid using round, floor, ceiling, or any other math functions. 3. Top N (e.g., 10) most frequent hosts - Note: The result should be ordered from top 1 to N 4. Top N (e.g., 10) most frequent urls - Note: The result should be ordered from top 1 to N 5. Top N (e.g., 5) urls that received error response codes, (i.e., non 200 response codes) - Note: The result should be ordered from top 1 to N 6. Total number of requests with 404 responses 7. Number of unique daily (in UTC time) hosts - Hint: Convert timestamp string into datetime type in UTC timezone - Note: The result should be ordered from the earliest date to the latest date 8. Average number of daily (in UTC time) requests per host - Hint: Convert timestamp string into datetime type in UTC timezone - Note: The result should be ordered from the earliest date to the latest date - if a number is a float number, you will need to cast it into an integer number Problem B (4 pts) Implement a function that can write the answers to Problem A into a JSON file. The format should match the following example: \{ "get_num_of_distinct_resp_code": 1, "get_median_content_size": 2, "get_most_freq_hosts":["/answer", "to", "q3"], "get_most_freq_urls": ["/answer", "to", "q4"], "get_top_urls_recv_err": ["/answer", "to", "q5"], "get_num_of_req_recv_404": 6, "get_num_of_unique_hosts_daily": [7, 0,0], "get_avg_num_of_req_per_host_daily": [8, 0, 0]
Step by Step Solution
There are 3 Steps involved in it
Step: 1
Get Instant Access to Expert-Tailored Solutions
See step-by-step solutions with expert insights and AI powered tools for academic success
Step: 2
Step: 3
Ace Your Homework with AI
Get the answers you need in no time with our AI-driven, step-by-step assistance
Get Started