Question
I want to add another step into the following code: remove duplicate URLs. from bs4 import BeautifulSoup from urllib.request import urlopen from urllib.parse import urljoin
I want to add another step into the following code: remove duplicate URLs.
from bs4 import BeautifulSoup
from urllib.request import urlopen
from urllib.parse import urljoin
import csv
my_url = 'https://www.census.gov/programs-surveys/popest/about/schedule.html'
# opening up connection, grabbing the page
page = urlopen(my_url)
# html parsering
soup = BeautifulSoup(page, 'html.parser')
#save as csv file
with open('index.csv','w') as csv_file:
writer = csv.writer(csv_file)
for link in soup.find_all('a', href=True):
url = link.get('href')
url = urljoin(my_url, url)
print (url)
writer.writerow([url])
I am trying to add this part:
#remove duplicate links
file = open('index.csv', 'w')
links = {}
for link in soup.find_all('a', href=True):
url = link.get('href')
url = urljoin(my_url, url)
if url not in links:
file.write("%s " % url)
links[url] = True
file.close()
Doesn't seem working. I want find all links from the web, all relative links become absolute URLs, no duplicate links, save as CSV file.
Step by Step Solution
There are 3 Steps involved in it
Step: 1
Get Instant Access to Expert-Tailored Solutions
See step-by-step solutions with expert insights and AI powered tools for academic success
Step: 2
Step: 3
Ace Your Homework with AI
Get the answers you need in no time with our AI-driven, step-by-step assistance
Get Started