Issue
I would like to get all hrefs which are within these li's in this ul: Click here to see screenshot
So far I wrote this line:
import bs4, requests, re
product_pages = []
def get_product_pages(openurl):
global product_pages
url = 'https://www.ah.nl/producten/aardappel-groente-fruit'
res = requests.get(url)
soup = bs4.BeautifulSoup(res.text, 'html.parser')
for li in soup.findAll('li', attrs={'class': 'taxonomy-sub-selector_root__3rtWx'}):
for a in li.findAll('a', href=True):
print(a.attrs['href'])
get_product_pages('')
But it is only giving me the hrefs from the first three li's. I am wondering why it is only the first three and I am wondering how to get all eight..
In the page there is a scroll bar, which might cause trouble?
Solution
The taxonomies and all other page data is stored inside page in <script>
so beautifulsoup doesn't see it. To get all children taxonomies from current category you can use next example (parsing the <script>
tag with re
/json
):
import re
import json
import requests
base_url = "https://www.ah.nl/producten"
url = base_url + "/aardappel-groente-fruit/fruit"
html_doc = requests.get(url).text
data = re.search(r"window\.__INITIAL_STATE__= ({.*})", html_doc)
data = data.group(1).replace("undefined", "null")
data = json.loads(data)
# uncomment this to print all data:
# print(json.dumps(data, indent=4))
taxonomies = {t["id"]: t for t in data["taxonomy"]["topLevel"]}
for t in data["taxonomy"]["taxonomies"]:
taxonomies[t["id"]] = t
def get_taxonomy(t, current, dupl=None):
if dupl is None:
dupl = set()
tmp = current + "/" + t["slugifiedName"]
yield tmp
for c in t["children"]:
if c in taxonomies and c not in dupl:
dupl.add(c)
yield from get_taxonomy(taxonomies[c], tmp, dupl)
for t in taxonomies.values():
if t["parents"] == [0]:
for t in get_taxonomy(t, base_url):
if url in t: # print only URL from current category
print(t)
Prints:
https://www.ah.nl/producten/aardappel-groente-fruit/fruit
https://www.ah.nl/producten/aardappel-groente-fruit/fruit/appels
https://www.ah.nl/producten/aardappel-groente-fruit/fruit/appels/groente-en-fruitbox
https://www.ah.nl/producten/aardappel-groente-fruit/fruit/bananen
https://www.ah.nl/producten/aardappel-groente-fruit/fruit/sinaasappels-mandarijnen
https://www.ah.nl/producten/aardappel-groente-fruit/fruit/peren
https://www.ah.nl/producten/aardappel-groente-fruit/fruit/ananas-mango-kiwi
https://www.ah.nl/producten/aardappel-groente-fruit/fruit/aardbeien-frambozen
https://www.ah.nl/producten/aardappel-groente-fruit/fruit/druiven-kersen
https://www.ah.nl/producten/aardappel-groente-fruit/fruit/bramen-bessen
https://www.ah.nl/producten/aardappel-groente-fruit/fruit/abrikozen-pruimen
https://www.ah.nl/producten/aardappel-groente-fruit/fruit/abrikozen-pruimen/exotisch-fruit
https://www.ah.nl/producten/aardappel-groente-fruit/fruit/perziken-nectarines
https://www.ah.nl/producten/aardappel-groente-fruit/fruit/meloen-kokosnoot
https://www.ah.nl/producten/aardappel-groente-fruit/fruit/grapefruit-minneola
https://www.ah.nl/producten/aardappel-groente-fruit/fruit/citroen-limoen
https://www.ah.nl/producten/aardappel-groente-fruit/fruit/fruit-spread
https://www.ah.nl/producten/aardappel-groente-fruit/fruit/vijgen
https://www.ah.nl/producten/aardappel-groente-fruit/fruit/kaki-papaya-cherimoya
https://www.ah.nl/producten/aardappel-groente-fruit/fruit/granaatappel-passiefruit
https://www.ah.nl/producten/aardappel-groente-fruit/fruit/fruitsalade-mix
https://www.ah.nl/producten/aardappel-groente-fruit/fruit/gedroogd-fruit
Answered By - Andrej Kesely
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.