Issue
I am doing a course which requires me to parse this using BeautifulSoup: http://python-data.dr-chuck.net/known_by_Fikret.html
The instructions are: Find the link at position 3 (the first name is 1). Follow that link. Repeat this process 4 times. The answer is the last name that you retrieve.
This is the code I have so far:
import urllib.request, urllib.parse, urllib.error
from bs4 import BeautifulSoup
import re
url = input('Enter - ')
html = urllib.request.urlopen(url).read()
soup = BeautifulSoup(html, 'html.parser')
count = int(input('Enter count: '))
pos = int(input('Enter position: ')) - 1
urllist = list()
taglist = list()
tags = soup('a')
for i in range(count):
for tag in tags:
taglist.append(tag)
url = taglist[pos].get('href', None)
print('Retrieving: ', url)
urllist.append(url)
print('Last URL: ', urllist[-1])
This is my output:
Retrieving: http://python-data.dr-chuck.net/known_by_Fikret.html
Retrieving: http://python-data.dr-chuck.net/known_by_Montgomery.html
Retrieving: http://python-data.dr-chuck.net/known_by_Montgomery.html
Retrieving: http://python-data.dr-chuck.net/known_by_Montgomery.html
Retrieving: http://python-data.dr-chuck.net/known_by_Montgomery.html
Last URL: http://python-data.dr-chuck.net/known_by_Montgomery.html
This is the output that I am supposed to get:
Retrieving: http://python-data.dr-chuck.net/known_by_Fikret.html
Retrieving: http://python-data.dr-chuck.net/known_by_Montgomery.html
Retrieving: http://python-data.dr-chuck.net/known_by_Mhairade.html
Retrieving: http://python-data.dr-chuck.net/known_by_Butchi.html
Retrieving: http://python-data.dr-chuck.net/known_by_Anayah.html
Last URL: http://python-data.dr-chuck.net/known_by_Anayah.html
I've been working on this for a while but I still have not been able to get the code to loop correctly. I am new to coding and I'm just looking for some help to point me in the right direction. Thanks.
Solution
def get_html(url):
html = urllib.request.urlopen(url).read()
soup = BeautifulSoup(html, 'html.parser')
return soup
url = input('Enter - ')
count = int(input('Enter count: '))
pos = int(input('Enter position: ')) - 1
urllist = list()
for i in range(count):
taglist = list()
for tag in get_html(url)('a'): # Needed to update your variable to new url html
taglist.append(tag)
url = taglist[pos].get('href', None) # You grabbed url but never updated your tags variable.
print('Retrieving: ', url)
urllist.append(url)
print('Last URL: ', urllist[-1])
Answered By - Shaun Baker
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.