Answered step by step
Verified Expert Solution
Link Copied!
Question
1 Approved Answer

NEEDS TO BE CODED IN Python 3 Need both code and algorithm: Deliverables : You must use functions to modularize your work in a logical

NEEDS TO BE CODED IN Python 3

Need both code and algorithm:

Deliverables :

You must use functions to modularize your work in a logical way. You should use exception handling where necessary as well. All submitted work must be your own.

Email Scraper

How do spammers get your email addresses? There are a lot of methods that are used to create the collections of email addresses that marketers use. Sometimes, the websites you sign up at sell your information, including your email. They also, have bots that scour the internet and scrape email addresses off of web pages. We are going to write our own simple bot. Your program will ask the user for a file that contains URLs ( web sites ). It will load each one and search for email addresses to scrape and use. Python has a module that will help us pull data off of websites. We can pull it down just like text. It obviously wont look like the web page, but will contain the HTML markup. HTML stands for Hypertext Markup Language. You can view the page source in virtually any browser. Right click on the page you want the source, and youll likely see a menu with an option that says View Page Source. This is what HTML code looks like, and by scouring the text we can find email addresses in the pages.

urllib module:

urllib is the python library that helps with urls. You should view the documentation for the module, since you never know when youll find something useful, but for what we need, it is straightforward. import urllib.request # Import should be done at the top of your program request = urllib.request.Request("http://cnn.com") # First create a request object

response = urllib.request.urlopen(request) # Create a response object

# after we open the request.

page_data = response.read() # page_data has the text

Page_str = page_data.decode(utf-8) # convert the byte text to a

# utf-8 string

response.close() # Remember to close the response What happens when you pass a bad URL to the request? If it creates an error you probably want to use our error handling powers to solve that issue.

How can we tell what an email is?

Were going to be looking for portions of text that start with mailto: Including the colon. The email address follows that. How do we know where the email address stops? It stops when you reach any character that is not .@&#; or digits 0-9, or any alpha character a-z upper or lower. This isnt the most resilient way, but it will give you some good practice working with strings. The strings you are going to get from websites will be extremely large. So making a function that you can pass smaller strings to and experiment with is crucial to debugging and finding errors in a timely and efficient manner.

Encoded Emails Some email addresses are encoded. If you get an email address that is encoded then youll want to parse it and create the real email address. webmaster@umk&# 099;.edu This email address is html encoded. Weve already seen that characters are a decimal number.

chr(119) # Returns w

chr(101 # returns e

Decoding this entire string like this would result in an email address of webmaster@umkc.edu Clearly when you have an email address that looks like this you are looking for sections that start with &# and have a number up to 3 digits and then a semicolon. Again, you may find this easier to write a function to do this one thing by itself and return an unencoded string.

Our programs goals

We want to write a program to ask the user for a file that has URLs in it. One url on each line. If the user gives us a file that doesnt exist, or cant be opened then you must be able to handle those errors. Once you have a file, open each url and get the contents, find all the email addresses. Once you are done eliminate the duplicates and ask the user for a file to write out the email addresses to.

Program Specifications

The requires are below, but an additional requirement has to be observed. There are many tools that can make much of this easier to do. In fact many of them make it trivially easy. This isnt a course about finding and using libraries and modules, so youll be stuck using strings and your wits ( besides urllib of course ). However, you may be interested once youve solved it to look at 3rd party modules like BeautifulSoup ( terrible name ). It helps in parsing and working with HTML and XML. Another built-in module that is quite useful is re or regular expressions. Spending some time learning regular expressions at some point will pay off for you. Regular expressions are extremely powerful, flexible and useful for validating data and finding matching strings. Another module that is useful for unescaping encoded email addresses below is cgi.html.unescape which is built in. Learning to do these things by hand will apy off later when you dont have a tool that can do it for you. These are the skills that will allow you to build your own solutions.

n summary you are not allowed to use

Beautiful Soup

re ( regular expressions )

cgi

Any imported module other than urllib

Sample Program

Python 3.4.3 (v3.4.3:9b73f1c3e601, Feb 24 2015, 22:43:06) [MSC v.1600 32 bit (Intel)] on win32

Type "copyright", "credits" or "license()" for more information.

>>> ================================ RESTART ================================ >>> Welcome to email scraper!

Enter the filename containting URLs to read ==> invalid.txt

Could not open the file invalid.txt. It doesn't exist.

Enter the filename containting URLs to read ==> subdir

Could not open the file subdir. There was an IOError

Enter the filename containting URLs to read ==> emails.txt

Enter a file to save the emails to ==> output.txt

Do you want to run this application again? Y/YES/N/NO ==> e

You must enter only Y/YES/N or NO only. Do you want to run this application again? Y/YES/N/NO ==> y Welcome to email scraper!

Enter the filename containting URLs to read ==> emails2.txt

We did not find any emails in the provided urls to save Do you want to run this application again? Y/YES/N/NO ==> y Welcome to email scraper! Enter the filename containting URLs to read ==> emails3.txt

sce.umkc.edu does not seem to be a valid url

invalid_url does not seem to be a valid url

We did not find any emails in the provided urls to save Do you want to run this application again? Y/YES/N/NO ==> n Sample output.txt sce@umkc.edu webmaster@umkc.edu

Step by Step Solution

There are 3 Steps involved in it

Step: 1

blur-text-image
Get Instant Access to Expert-Tailored Solutions

See step-by-step solutions with expert insights and AI powered tools for academic success

Step: 2

blur-text-image_2

Step: 3

blur-text-image_3

Ace Your Homework with AI

Get the answers you need in no time with our AI-driven, step-by-step assistance

Get Started

Students explore these related Databases questions