Issue
this is a follow up, question on the question which I asked earlier and got a very good answer, but, that code, I didn't understand fully the program. Please help me to scrape information from the following websites.
- https://premieragile.com/csm-training/
- https://www.simplilearn.com/agile-and-scrum/csm-certification-training
Here i want all the information given in each card. Also, adding the program I am using, which i got from stackoverflow itself.
import requests
import pandas as pd
from bs4 import BeautifulSoup
url = "https://premieragile.com/csm-training/"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
all_data = []
for row in soup.select(".row > schedules-courses br-10 h-100 p-3 p-sm-4"):
date = row.findAll(".d-flex align-items-center pb-4 h6").text.strip()
# year = row.select_one(".li .batchDetails .date-details .date span").text.strip()
# rating = row.select_one(".imdbRating").text.strip()
# ...other variables
all_data.append([date])
df = pd.DataFrame(all_data, columns=["date"])
print(df.head().to_markdown(index=False))
here, please explain how I should add div class in the 'for loop', also, what will be the hierarchy of the
- div
- li
- h
- ul
- li
Please help me understand this, I got the general idea that we are crating empty list and adding data in those using beautiflSoup object. I am utterly confused in how I should study the website I want to scrape and thus, how to add column in the row of the program.
P.S I m getting blank output.
Solution
Content is dynamically loaded from another resource. It do not contain in your soup, thats why you get an empty output.
Simply load it from this resource https://premieragile.com/csm-training/?page=1&id=ol&city=&countryCode=DE&trainerid=undefined&timezone=Europe/Berlin and adjust parameters for your needs.
url = "https://premieragile.com/csm-training/?page=1&id=ol&city=&countryCode=DE&trainerid=undefined&timezone=Europe/Berlin"
HMTL is wrapped in JSON structur so you have to specify the path from that the BeautifulSoup
object should be created from.
r = requests.get(url, headers={'x-requested-with': 'XMLHttpRequest'}).json()['html']
Example
import requests
import pandas as pd
from bs4 import BeautifulSoup
import json
url = "https://premieragile.com/csm-training/?page=1&id=ol&city=&countryCode=DE&trainerid=undefined&timezone=Europe/Berlin"
r = requests.get(url, headers={'x-requested-with': 'XMLHttpRequest'}).json()['html']
soup = BeautifulSoup(r)
all_data = []
for e in soup.select('.loop'):
all_data.append({
'trainer':e.h6.text.strip(),
'date': ' '.join(s.strip() for s in e.li.text.split('\n'))
})
all_data
df = pd.DataFrame(all_data)
print(df.head().to_markdown(index=False))
Output
trainer | date |
---|---|
Daniel James Gullo | 08 Jul - 08 Jul - 2022 |
Raj Kasturi | 11 Jul - 13 Jul - 2022 |
Michel Goldenberg | 11 Jul - 12 Jul - 2022 |
Valerio Zanini | 12 Jul - 14 Jul - 2022 |
Michael Franken | 13 Jul - 15 Jul - 2022 |
Answered By - HedgeHog
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.