Issue
So the idea is to scrap this particular page
getting started with python perlego
so the idea is for a particular book, we look at the table of content and return every heading from Title Page to other books you may like
by using inspect element i found the tag and classes the table was using
however in my following code:
import requests
import json
from bs4 import BeautifulSoup
url = "https://www.perlego.com/book/921329/getting-started-with-python-understand-key-data-structures-and-use-python-in-objectoriented-programming-pdf?queryID=9315f2c9285af80efdc99eaa9c5621bc&index=prod_BOOKS&gridPosition=2"
r = requests.get(url)
print(r.status_code)
soup = BeautifulSoup(r.content, 'html.parser')
#another extra number on the side of sc-b81....-1 is the next link
print(soup.find_all(attrs={'class': 'sc-b81fc1ca-0'}))
what is printed out by this function is
<div class="sc-b81fc1ca-0 eqkOXa" data-testid="table-of-contents"><h2 class="sc-b81fc1ca-1 OnMGm">Table of contents</h2></div>]
whereas i would like all the tags under this class tag sc-b81fc1ca-2 although i've tried searching using findall but it only returns an empty list
Solution
The content you're looking for only loads after some javascript runs on the page. This tutorial should help you get that javascript to run before performing your scapeing:
https://pythonprogramming.net/javascript-dynamic-scraping-parsing-beautiful-soup-tutorial/
Answered By - Michael Jones
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.