Answered step by step
Verified Expert Solution
Link Copied!

Question

1 Approved Answer

I want to add another step into the following code: remove duplicate URLs. from bs4 import BeautifulSoup from urllib.request import urlopen from urllib.parse import urljoin

I want to add another step into the following code: remove duplicate URLs.

from bs4 import BeautifulSoup

from urllib.request import urlopen

from urllib.parse import urljoin

import csv

my_url = 'https://www.census.gov/programs-surveys/popest/about/schedule.html'

# opening up connection, grabbing the page

page = urlopen(my_url)

# html parsering

soup = BeautifulSoup(page, 'html.parser')

#save as csv file

with open('index.csv','w') as csv_file:

writer = csv.writer(csv_file)

for link in soup.find_all('a', href=True):

url = link.get('href')

url = urljoin(my_url, url)

print (url)

writer.writerow([url])

I am trying to add this part:

#remove duplicate links

file = open('index.csv', 'w')

links = {}

for link in soup.find_all('a', href=True):

url = link.get('href')

url = urljoin(my_url, url)

if url not in links:

file.write("%s " % url)

links[url] = True

file.close()

Doesn't seem working. I want find all links from the web, all relative links become absolute URLs, no duplicate links, save as CSV file.

Step by Step Solution

There are 3 Steps involved in it

Step: 1

blur-text-image

Get Instant Access to Expert-Tailored Solutions

See step-by-step solutions with expert insights and AI powered tools for academic success

Step: 2

blur-text-image

Step: 3

blur-text-image

Ace Your Homework with AI

Get the answers you need in no time with our AI-driven, step-by-step assistance

Get Started

Students also viewed these Programming questions

Question

Who owns the U.S. government's debt?

Answered: 1 week ago