Issue
I trying to get all the book categories from this website: http://books.toscrape.com/
When I inspect the element I see that the categories are in a list towards the top of the html. They are in <div class="side_categories">
My code:
from bs4 import BeautifulSoup
import requests
url = "http://books.toscrape.com/"
page = requests.get(url)
soup = BeautifulSoup(page.text, "html.parser")
categories = soup.find_all(class_="side_categories")
This returns:
[<div class="side_categories">
<ul class="nav nav-list">
<li>
<a href="catalogue/category/books_1/index.html">
Books
</a>
<ul>
<li>
<a href="catalogue/category/books/travel_2/index.html">
Travel
</a>
</li>
<li>
<a href="catalogue/category/books/mystery_3/index.html">
Mystery
</a>
</li>...#the rest of the categories.
Now I'm a bit stuck as I can't go through these like I would a list. Beautiful soup has an example that returns a list. https://beautiful-soup-4.readthedocs.io/en/latest/#find-all
Their example returns this:
soup.find_all("a")
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
Mine doesn't have that structure. What am I doing wrong? I'm running with these in my python environment:
beautifulsoup4==4.12.3
bs4==0.0.2
certifi==2023.11.17
charset-normalizer==3.3.2
idna==3.6
numpy==1.26.3
pandas==2.2.0
python-dateutil==2.8.2
pytz==2023.4
requests==2.31.0
six==1.16.0
soupsieve==2.5
tzdata==2023.4
urllib3==2.1.0
Solution
The issue is, that you select the outer element, that holds all the links, what results into a ResulSet
with just a single element.
Try to select your elenments more specific, then you can iterate to extract needed information.
for link in soup.select('.side_categories li a'):
print(url+link.get('href'))
Used css selectors
here to chain, what I like to select.
Based on your example:
for link in soup.find(class_='side_categories').find_all('a'):
print(url+link.get('href'))
Answered By - HedgeHog
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.