NEEDS TO BE CODED IN Python 3 Need both code and algorithm: Deliverables : You must use functions to modularize your work in a logical
NEEDS TO BE CODED IN Python 3
Need both code and algorithm:
Deliverables :
You must use functions to modularize your work in a logical way. You should use exception handling where necessary as well. All submitted work must be your own.
Email Scraper
How do spammers get your email addresses? There are a lot of methods that are used to create the collections of email addresses that marketers use. Sometimes, the websites you sign up at sell your information, including your email. They also, have bots that scour the internet and scrape email addresses off of web pages. We are going to write our own simple bot. Your program will ask the user for a file that contains URLs ( web sites ). It will load each one and search for email addresses to scrape and use. Python has a module that will help us pull data off of websites. We can pull it down just like text. It obviously wont look like the web page, but will contain the HTML markup. HTML stands for Hypertext Markup Language. You can view the page source in virtually any browser. Right click on the page you want the source, and youll likely see a menu with an option that says View Page Source. This is what HTML code looks like, and by scouring the text we can find email addresses in the pages.
urllib module:
urllib is the python library that helps with urls. You should view the documentation for the module, since you never know when youll find something useful, but for what we need, it is straightforward. import urllib.request # Import should be done at the top of your program request = urllib.request.Request("http://cnn.com") # First create a request object
response = urllib.request.urlopen(request) # Create a response object
# after we open the request.
page_data = response.read() # page_data has the text
Page_str = page_data.decode(utf-8) # convert the byte text to a
# utf-8 string
response.close() # Remember to close the response What happens when you pass a bad URL to the request? If it creates an error you probably want to use our error handling powers to solve that issue.
How can we tell what an email is?
Were going to be looking for portions of text that start with mailto: Including the colon. The email address follows that. How do we know where the email address stops? It stops when you reach any character that is not .@ or digits 0-9, or any alpha character a-z upper or lower. This isnt the most resilient way, but it will give you some good practice working with strings. The strings you are going to get from websites will be extremely large. So making a function that you can pass smaller strings to and experiment with is crucial to debugging and finding errors in a timely and efficient manner.
Encoded Emails Some email addresses are encoded. If you get an email address that is encoded then youll want to parse it and create the real email address. webmaster@umk 099;.edu This email address is html encoded. Weve already seen that characters are a decimal number.
chr(119) # Returns w
chr(101 # returns e
Decoding this entire string like this would result in an email address of webmaster@umkc.edu Clearly when you have an email address that looks like this you are looking for sections that start with and have a number up to 3 digits and then a semicolon. Again, you may find this easier to write a function to do this one thing by itself and return an unencoded string.
Our programs goals
We want to write a program to ask the user for a file that has URLs in it. One url on each line. If the user gives us a file that doesnt exist, or cant be opened then you must be able to handle those errors. Once you have a file, open each url and get the contents, find all the email addresses. Once you are done eliminate the duplicates and ask the user for a file to write out the email addresses to.
Program Specifications
The requires are below, but an additional requirement has to be observed. There are many tools that can make much of this easier to do. In fact many of them make it trivially easy. This isnt a course about finding and using libraries and modules, so youll be stuck using strings and your wits ( besides urllib of course ). However, you may be interested once youve solved it to look at 3rd party modules like BeautifulSoup ( terrible name ). It helps in parsing and working with HTML and XML. Another built-in module that is quite useful is re or regular expressions. Spending some time learning regular expressions at some point will pay off for you. Regular expressions are extremely powerful, flexible and useful for validating data and finding matching strings. Another module that is useful for unescaping encoded email addresses below is cgi.html.unescape which is built in. Learning to do these things by hand will apy off later when you dont have a tool that can do it for you. These are the skills that will allow you to build your own solutions.
n summary you are not allowed to use
Beautiful Soup
re ( regular expressions )
cgi
Any imported module other than urllib
Sample Program
Python 3.4.3 (v3.4.3:9b73f1c3e601, Feb 24 2015, 22:43:06) [MSC v.1600 32 bit (Intel)] on win32
Type "copyright", "credits" or "license()" for more information.
>>> ================================ RESTART ================================ >>> Welcome to email scraper!
Enter the filename containting URLs to read ==> invalid.txt
Could not open the file invalid.txt. It doesn't exist.
Enter the filename containting URLs to read ==> subdir
Could not open the file subdir. There was an IOError
Enter the filename containting URLs to read ==> emails.txt
Enter a file to save the emails to ==> output.txt
Do you want to run this application again? Y/YES/N/NO ==> e
You must enter only Y/YES/N or NO only. Do you want to run this application again? Y/YES/N/NO ==> y Welcome to email scraper!
Enter the filename containting URLs to read ==> emails2.txt
We did not find any emails in the provided urls to save Do you want to run this application again? Y/YES/N/NO ==> y Welcome to email scraper! Enter the filename containting URLs to read ==> emails3.txt
sce.umkc.edu does not seem to be a valid url
invalid_url does not seem to be a valid url
We did not find any emails in the provided urls to save Do you want to run this application again? Y/YES/N/NO ==> n Sample output.txt sce@umkc.edu webmaster@umkc.edu
Step by Step Solution
There are 3 Steps involved in it
Step: 1
See step-by-step solutions with expert insights and AI powered tools for academic success
Step: 2
Step: 3
Ace Your Homework with AI
Get the answers you need in no time with our AI-driven, step-by-step assistance
Get Started