Monday, December 4, 2023

[FIXED] BeautifulSoup ingests all data but .findAll() returns only the links of one parent

December 04, 2023 beautifulsoup, html, python, python-3.x, web-scraping No comments

Issue

I am trying to scrape a website using BeautifulSoup in Python. All data is ingested, including all the links I am trying to get to. However, when I use the .findAll() function it returns only a part of the links I am looking for. That is to say only the links in the following xpath are returned

/html/body/div[1]/div/div[2]/div/div[2]/div[1]

This ignores the links in /html/body/div[1]/div/div[2]/div/div[2]/div[2] /html/body/div[1]/div/div[2]/div/div[2]/div[3] etc.

import requests
from bs4 import BeautifulSoup

url = "https://www.riksdagen.se/sv/ledamoter-och-partier/ledamoterna/"
response = requests.get(url)
content = BeautifulSoup(response.content, "html.parser")
mp_pages = []
mps = content.findAll(attrs = {'class': 'sc-907102a3-0 sc-e6d2fd61-0 gOAsvA jBTDjv'})
for x in mps:
    mp_pages.append(x.get('href'))

print(mp_pages)

I expected all links to be appended in the mp_pages list, but it only went down one parent (those starting with A), seemingly stopping at the last child, not continuing.

I have seen similar questions where the answer was using selenium due to javascript, but since all the links are in content, that makes no sense.

Solution

The data you see on the page is stored inside <script> element in Json form. To parse it you can use next example:

import json
import requests
import pandas as pd
from bs4 import BeautifulSoup


url = 'https://www.riksdagen.se/sv/ledamoter-och-partier/ledamoterna/'

soup = BeautifulSoup(requests.get(url).content, 'html.parser')
data = json.loads(soup.select_one('#__NEXT_DATA__').text)

# print(json.dumps(data, indent=4))

all_data = []
for c in data['props']['pageProps']['contentApiData']['commissioners']:
    all_data.append((f'{c["callingName"]} {c["surname"]}', c['url']))

df = pd.DataFrame(all_data, columns=['Name', 'URL'])
print(df)

Prints:

                              Name                                                                                                                            URL
0                 Fredrik Ahlstedt               https://www.riksdagen.se/sv/ledamoter-och-partier/ledamot/fredrik-ahlstedt_8403346f-0f0c-4d48-bbd0-f6b43b368873/
1             Emma Ahlström Köster           https://www.riksdagen.se/sv/ledamoter-och-partier/ledamot/emma-ahlstrom-koster_e09d9076-28c7-4583-a17f-7a776de7f01f/
2                  Alireza Akhondi                https://www.riksdagen.se/sv/ledamoter-och-partier/ledamot/alireza-akhondi_4099ff9c-5d27-4605-b018-98fb229d94fa/
3                  Anders Alftberg                https://www.riksdagen.se/sv/ledamoter-och-partier/ledamot/anders-alftberg_f0d945f3-9449-458e-ba40-1a0da1a72303/
4                   Leila Ali Elmi                 https://www.riksdagen.se/sv/ledamoter-och-partier/ledamot/leila-ali-elmi_5997ba96-4f01-46f4-8bd8-e1411a9d503b/
5               Janine Alm Ericson             https://www.riksdagen.se/sv/ledamoter-och-partier/ledamot/janine-alm-ericson_7e408079-a5cd-432a-a30e-fd61fd15c65a/
6                    Ann-Sofie Alm                  https://www.riksdagen.se/sv/ledamoter-och-partier/ledamot/ann-sofie-alm_f91f6a86-591c-449c-b3dd-1fdaa86338cd/
7                      Sofia Amloh                    https://www.riksdagen.se/sv/ledamoter-och-partier/ledamot/sofia-amloh_359e75f3-519e-49d7-b155-ada488e621ea/
8             Andrea Andersson Tay           https://www.riksdagen.se/sv/ledamoter-och-partier/ledamot/andrea-andersson-tay_352b875d-e44d-43f5-bf93-e507770c12de/

...and so on.

Answered By - Andrej Kesely

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Monday, December 4, 2023

[FIXED] BeautifulSoup ingests all data but .findAll() returns only the links of one parent

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels