Question
Currently the script associated with Example 5 in Week 5 shows a list of how many unique IP addresses are present in the log. However,
Currently the script associated with Example 5 in Week 5 shows a list of how many unique IP addresses are present in the log. However, such list may not be sufficient without a count associated with the number of unexpected accesses to a particular resource. For this assignment you are tasked to modify the script from Week 5 (which is shown below) to show the following:
A list of IP addresses in ascending order,
How many times an IP address is reported in the log, and
Whether an IP address is in a list of known IP addresses.
Script Example From Week 5 Below
import os
from pathlib import Path
import re
# Getting the directory that contains the script, so all file operations will take place in that directory
script_home_dir = os.path.dirname(os.path.abspath(__file__))
sample_file = 'HDFS_2k.log'
# This list will host the content of the file
file_content = []
# Reading the file
with open(Path(script_home_dir, sample_file), 'r') as my_file:
file_content = my_file.readlines()
# Trying to match only valid IP addresses (0-255) - Source: https://ihateregex.io/expr/ip/
digits_pattern = re.compile(r'(\b25[0-5]|\b2[0-4][0-9]|\b[01]?[0-9][0-9]?)(\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)){3}')
print(f" Search pattern: {digits_pattern}")
ip_addresses_in_log = []
# Extracting multiple IP addresses from each line
# Working on each line to match the pattern, using a different method that can get all matches
for i, line in enumerate(file_content):
line = line.strip()
search_outcome = digits_pattern.finditer(line) # This is the new way to attempting to find matches within the line
print(f'Line {i}: {line}')
if search_outcome != None:
# print(search_outcome)
for n, ip in enumerate(search_outcome):
ip_addr = ip.group()
print(f'--- Match {n}: {ip_addr}')
if ip_addr not in ip_addresses_in_log:
ip_addresses_in_log.append(ip_addr)
# Limiting the loop to the first few lines - Remove the if block below to run through the entire log file
if i > 100:
break
# ----------------------------------------
print(" Distinct IP addresses in the log:")
for ip_addr in ip_addresses_in_log:
print(f'IP: {ip_addr}')
An example of the output is the following:
IP Address | Count | Expected |
10.50.100.150 | 72 | Yes |
10.50.100.152 | 100 | Yes |
10.50.100.155 | 46 | No |
Please note that the table above is just a depiction and I am not expecting an actual table.
The source of the analysis should be the HDFS_2k.log file, which was used as a Data File 2 for Week 5, attached to this assignment. The list of known IP addresses is also attached to this assignment.
Notes
Your scripts should be fully commented, including who is the author, purpose of the script, and date
Step by Step Solution
There are 3 Steps involved in it
Step: 1
Get Instant Access to Expert-Tailored Solutions
See step-by-step solutions with expert insights and AI powered tools for academic success
Step: 2
Step: 3
Ace Your Homework with AI
Get the answers you need in no time with our AI-driven, step-by-step assistance
Get Started