Thursday, July 14, 2022

and using python, I m making web scraper

July 14, 2022 beautifulsoup, html, pandas, python, web-scraping No comments

Issue

this is a follow up, question on the question which I asked earlier and got a very good answer, but, that code, I didn't understand fully the program. Please help me to scrape information from the following websites.

Here i want all the information given in each card. Also, adding the program I am using, which i got from stackoverflow itself.

import requests
import pandas as pd
from bs4 import BeautifulSoup

url = "https://premieragile.com/csm-training/"
soup = BeautifulSoup(requests.get(url).content, "html.parser")

all_data = []
for row in soup.select(".row > schedules-courses br-10 h-100 p-3 p-sm-4"):
    date = row.findAll(".d-flex align-items-center pb-4 h6").text.strip()
#     year = row.select_one(".li .batchDetails .date-details .date span").text.strip()
#     rating = row.select_one(".imdbRating").text.strip()
    # ...other variables

    all_data.append([date])


df = pd.DataFrame(all_data, columns=["date"])
print(df.head().to_markdown(index=False))

here, please explain how I should add div class in the 'for loop', also, what will be the hierarchy of the

Please help me understand this, I got the general idea that we are crating empty list and adding data in those using beautiflSoup object. I am utterly confused in how I should study the website I want to scrape and thus, how to add column in the row of the program.

P.S I m getting blank output.

Solution

Content is dynamically loaded from another resource. It do not contain in your soup, thats why you get an empty output.

Simply load it from this resource https://premieragile.com/csm-training/?page=1&id=ol&city=&countryCode=DE&trainerid=undefined&timezone=Europe/Berlin and adjust parameters for your needs.

url = "https://premieragile.com/csm-training/?page=1&id=ol&city=&countryCode=DE&trainerid=undefined&timezone=Europe/Berlin"

HMTL is wrapped in JSON structur so you have to specify the path from that the BeautifulSoup object should be created from.

r = requests.get(url, headers={'x-requested-with': 'XMLHttpRequest'}).json()['html']

Example

import requests
import pandas as pd
from bs4 import BeautifulSoup
import json

url = "https://premieragile.com/csm-training/?page=1&id=ol&city=&countryCode=DE&trainerid=undefined&timezone=Europe/Berlin"

r = requests.get(url, headers={'x-requested-with': 'XMLHttpRequest'}).json()['html']

soup = BeautifulSoup(r)

all_data = []
for e in soup.select('.loop'):
    all_data.append({
        'trainer':e.h6.text.strip(),
        'date': ' '.join(s.strip() for s in e.li.text.split('\n'))
    })
all_data

df = pd.DataFrame(all_data)
print(df.head().to_markdown(index=False))

Output

trainer	date
Daniel James Gullo	08 Jul - 08 Jul - 2022
Raj Kasturi	11 Jul - 13 Jul - 2022
Michel Goldenberg	11 Jul - 12 Jul - 2022
Valerio Zanini	12 Jul - 14 Jul - 2022
Michael Franken	13 Jul - 15 Jul - 2022

Answered By - HedgeHog

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Thursday, July 14, 2022

[FIXED] Unable to get data form <li> _data_ </li> and using python, I m making web scraper

Issue

Solution

Example

Output

0 comments:

Post a Comment

Popular Posts

Labels