Sunday, October 16, 2022

[FIXED] Looping through HTML & following links

October 16, 2022 beautifulsoup, for-loop, html, loops, python No comments

Issue

I am writing a code that is supposed to open a url, identify the 3rd link and repeat this process 3 times (each time with the new url).

I wrote a loop (below), but it seems to each time sart over with the original url.

Can someone help me fix my code?

import urllib.request, urllib.parse, urllib.error
from urllib.parse import urljoin
from bs4 import BeautifulSoup

#blanc list
l = []

#starting url
url = input('Enter URL: ')
if len(url) < 1:
    url = 'http://py4e-data.dr-chuck.net/known_by_Fikret.html'

#loop 
for _ in range(4):
    html = urllib.request.urlopen(url).read()    #open url
    soup = BeautifulSoup(html, 'html.parser')    #parse through BeautifulSoup
    tags = soup('a')    #extract tags
    
    for tag in tags:
        url = tag.get('href', None)    #extract links from tags
        l.append(url)    #add the links to a list
        url = l[2:3]    #slice the list to extract the 3rd url
        url = ' '.join(str(e) for e in url)    #change the type to string
    print(url)

Current Output: 
http://py4e-data.dr-chuck.net/known_by_Montgomery.html
http://py4e-data.dr-chuck.net/known_by_Montgomery.html
http://py4e-data.dr-chuck.net/known_by_Montgomery.html
http://py4e-data.dr-chuck.net/known_by_Montgomery.html

Desired output:
http://py4e-data.dr-chuck.net/known_by_Montgomery.html
http://py4e-data.dr-chuck.net/known_by_Mhairade.html
http://py4e-data.dr-chuck.net/known_by_Butchi.html
http://py4e-data.dr-chuck.net/known_by_Anayah.html

Solution

You need to define the empty list within the loop. The following code works:

import urllib.request, urllib.parse, urllib.error
from urllib.parse import urljoin
from bs4 import BeautifulSoup

#blanc list
# l = []

#starting url
url = input('Enter URL: ')
if len(url) < 1:
    url = 'http://py4e-data.dr-chuck.net/known_by_Fikret.html'

#loop 
for _ in range(4):
    l = []
    html = urllib.request.urlopen(url).read()    #open url
    soup = BeautifulSoup(html, 'html.parser')    #parse through BeautifulSoup
    tags = soup('a')    #extract tags
    
    for tag in tags:
        url = tag.get('href', None)    #extract links from tags
        l.append(url)    #add the links to a list
        url = l[2:3]    #slice the list to extract the 3rd url
        url = ' '.join(str(e) for e in url)    #change the type to string
    print(url)

Result in terminal:

http://py4e-data.dr-chuck.net/known_by_Montgomery.html
http://py4e-data.dr-chuck.net/known_by_Mhairade.html
http://py4e-data.dr-chuck.net/known_by_Butchi.html
http://py4e-data.dr-chuck.net/known_by_Anayah.html

Answered By - Barry the Platipus

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Sunday, October 16, 2022

[FIXED] Looping through HTML & following links

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels