Issue
I am trying to scrape a website using BeautifulSoup in Python. All data is ingested, including all the links I am trying to get to. However, when I use the .findAll() function it returns only a part of the links I am looking for. That is to say only the links in the following xpath are returned
/html/body/div[1]/div/div[2]/div/div[2]/div[1]
This ignores the links in /html/body/div[1]/div/div[2]/div/div[2]/div[2] /html/body/div[1]/div/div[2]/div/div[2]/div[3] etc.
import requests
from bs4 import BeautifulSoup
url = "https://www.riksdagen.se/sv/ledamoter-och-partier/ledamoterna/"
response = requests.get(url)
content = BeautifulSoup(response.content, "html.parser")
mp_pages = []
mps = content.findAll(attrs = {'class': 'sc-907102a3-0 sc-e6d2fd61-0 gOAsvA jBTDjv'})
for x in mps:
mp_pages.append(x.get('href'))
print(mp_pages)
I expected all links to be appended in the mp_pages list, but it only went down one parent (those starting with A), seemingly stopping at the last child, not continuing.
I have seen similar questions where the answer was using selenium due to javascript, but since all the links are in content, that makes no sense.
Solution
The data you see on the page is stored inside <script>
element in Json form. To parse it you can use next example:
import json
import requests
import pandas as pd
from bs4 import BeautifulSoup
url = 'https://www.riksdagen.se/sv/ledamoter-och-partier/ledamoterna/'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
data = json.loads(soup.select_one('#__NEXT_DATA__').text)
# print(json.dumps(data, indent=4))
all_data = []
for c in data['props']['pageProps']['contentApiData']['commissioners']:
all_data.append((f'{c["callingName"]} {c["surname"]}', c['url']))
df = pd.DataFrame(all_data, columns=['Name', 'URL'])
print(df)
Prints:
Name URL
0 Fredrik Ahlstedt https://www.riksdagen.se/sv/ledamoter-och-partier/ledamot/fredrik-ahlstedt_8403346f-0f0c-4d48-bbd0-f6b43b368873/
1 Emma Ahlström Köster https://www.riksdagen.se/sv/ledamoter-och-partier/ledamot/emma-ahlstrom-koster_e09d9076-28c7-4583-a17f-7a776de7f01f/
2 Alireza Akhondi https://www.riksdagen.se/sv/ledamoter-och-partier/ledamot/alireza-akhondi_4099ff9c-5d27-4605-b018-98fb229d94fa/
3 Anders Alftberg https://www.riksdagen.se/sv/ledamoter-och-partier/ledamot/anders-alftberg_f0d945f3-9449-458e-ba40-1a0da1a72303/
4 Leila Ali Elmi https://www.riksdagen.se/sv/ledamoter-och-partier/ledamot/leila-ali-elmi_5997ba96-4f01-46f4-8bd8-e1411a9d503b/
5 Janine Alm Ericson https://www.riksdagen.se/sv/ledamoter-och-partier/ledamot/janine-alm-ericson_7e408079-a5cd-432a-a30e-fd61fd15c65a/
6 Ann-Sofie Alm https://www.riksdagen.se/sv/ledamoter-och-partier/ledamot/ann-sofie-alm_f91f6a86-591c-449c-b3dd-1fdaa86338cd/
7 Sofia Amloh https://www.riksdagen.se/sv/ledamoter-och-partier/ledamot/sofia-amloh_359e75f3-519e-49d7-b155-ada488e621ea/
8 Andrea Andersson Tay https://www.riksdagen.se/sv/ledamoter-och-partier/ledamot/andrea-andersson-tay_352b875d-e44d-43f5-bf93-e507770c12de/
...and so on.
Answered By - Andrej Kesely
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.