Saturday, December 4, 2021

[FIXED] I'm trying to deduplicate weblinks scraped using Python & BeautifulSoup but it's not working

December 04, 2021 beautifulsoup, python No comments

Issue

I'm trying to scrape a website in Python, I got the links to print but in trying to make them a set to deduplicate, there are still duplicates. Anyone have any advice on what I am doing wrong? Thanks in advance!

Edit: So I tried what John suggested but my csv output is a cascading list of links across the excel sheet, it's crazy...I'll post the changes below this original code:

import requests
from bs4 import BeautifulSoup
page = "https://www.census.gov/programs-surveys/popest.html"
r   = requests.get(page)
raw_html = r.text
soup = BeautifulSoup(raw_html, 'html.parser')
mylink = soup.find_all('a')
print ('The number of links to start with are: ', len(mylink) )    
#output = The number of links to start with are: 254
import csv
with open('census_links.csv', 'w', newline='') as f: 
weblinks = str(mylink)
writer = csv.writer(f, delimiter = ' ', lineterminator = '\r')
for link in mylink:
    hrefs = str(link.get('href'))
    if hrefs.startswith("None"):
        ''
    elif hrefs.startswith('http'):
        MySet = set()
        MySet.add(hrefs)
    elif hrefs.startswith('#'):
        ''
    elif hrefs.startswith(' '):
        ''
    print(set(MySet))
    file.write(str(MySet)+'\n')
    file.close


#Edited code:
import requests
from bs4 import BeautifulSoup
page = "https://www.census.gov/programs-surveys/popest.html"
r   = requests.get(page)
raw_html = r.text
soup = BeautifulSoup(raw_html, 'html.parser')
mylink = soup.find_all('a')
print ('The number of links to start with are: ', len(mylink))
# The number of links to start with are:  254
import csv
with open('census_links.csv', 'w', newline='') as f:
    weblinks = str(mylink)
    writer = csv.writer(f, delimiter = ',', lineterminator = '\r')
    MySet = set()
for link in mylink:
    hrefs = str(link.get('href'))
    if hrefs.startswith("None"):
        continue
    elif hrefs.startswith('#'):
        continue
    elif hrefs.startswith(' '):
        continue
    elif hrefs.startswith('http'):
        MySet.add(hrefs)
        file.write(str(MySet)+'\n')
        file.close
print(str(MySet) +'\n')

Solution

to get unique links, you want to check if the link is in MySet with hrefs not in MySet.

for simple operation you don't need csv, to write in single row

"\n".join(MySet)

and to write single column

",".join(MySet)

MySet = set()
for link in mylink:
    hrefs = link.get('href')
    if not hrefs or hrefs.startswith('#'):
        continue
    # normalize link
    if hrefs.startswith('/'):
      hrefs = 'https://www.census.gov' + hrefs

    # check if link already in MySet
    if hrefs not in MySet: 
        MySet.add(hrefs)

with open('census_links.csv', 'w', newline='') as f:
    f.write("\n".join(MySet))

print("\n".join(MySet))

Answered By - uingtea

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Saturday, December 4, 2021

[FIXED] I'm trying to deduplicate weblinks scraped using Python & BeautifulSoup but it's not working

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels