Thursday, February 1, 2024

[FIXED] Python 3 and Requests-Html: Trying to scrape a website - not getting the "real" html code back

February 01, 2024 python-3.x, python-requests-html, web-scraping No comments

Issue

I'm trying to scrape a website, but I'm not getting the correct, analyzable code back.

I am using python 3.12 and the requests HTML module to scrape the websites. For some of them it works without problems, but for "https://www.ostseewelle.de/sendungen/H%C3%B6rercharts-id379456.html" it doesn't work, although I use the render function of Requests-HTML to execute javascript code on the website. From analyzing the website, I know that the information I am looking for is contained in a tag with the attribute data-label = "artist". But in the HTML contained by the scraping and rendering there is not a single tag...

I don't know what to do, can someone help me and point me in the right direction?

from requests_html import HTML, HTMLSession


charts = {'ODC50': {
            'name': 'ODC50',
            'anz': 50,
            'url': 'https://www.mix1.de/charts/dance50.htm',
            'entry': 'div.charts-main-block',
            'date': '#mix1_content div.mybox_content'
        },
        'DDPHot50': {
            'name': 'DDP Hot50',
            'anz': 50,
            'url': 'https://www.deutsche-dj-playlist.de/hot-50/dance',
            'entry': 'div.list div.entry',
            'date': 'div.header div.title'
        },
        'Ostseewelle': {
            'name': 'Ostseewelle',
            'anz': 20,
            'url': 'https://www.ostseewelle.de/sendungen/H%C3%B6rercharts-id379456.html',
            'entry': 'section',
            'date': 'h3.text-center.titel1'
        }
}

choice = 'Ostseewelle'


chart_site = charts.get(choice).get('url')
session = HTMLSession()
r = session.get(chart_site)
r.html.render(sleep=2, keep_page=True, scrolldown=5, timeout=30)

print(r.status_code)

html = r.html

#print(html.html)

tds = html.xpath('//td[@data-label="Künstler"]')
print(f'Gefundene Einträge: {len(tds)}')


print('Programm beendet')

I don't get the correct HTML code back to parse, the expected code is missing.

Solution

The chart data on the page you see is loaded from external URL. To get the info about artists you can use next example:

import requests
from bs4 import BeautifulSoup

url = "https://enricoostendorf.de/top20/top20eo.php"

soup = BeautifulSoup(requests.get(url).content, "html.parser")

for k in soup.select('[data-label="Künstler"]'):
    l1, l2 = k.get_text(strip=True, separator="|||").split("|||")
    print(l1)
    print(l2)
    print("-" * 80)

Prints:

...

--------------------------------------------------------------------------------
Loi
"Am I Enough"
--------------------------------------------------------------------------------
Nico Santos & Fast Boy
"Where You Are"
--------------------------------------------------------------------------------
Ofenbach
"Overdrive" (feat. Norma Jean Martine)
--------------------------------------------------------------------------------
Robin Schulz, Rita Ora, Tiago PZK
"I'll Be There"
--------------------------------------------------------------------------------
Tate McRae
"greedy"
--------------------------------------------------------------------------------
Dua Lipa
"Houdini"
--------------------------------------------------------------------------------

Answered By - Andrej Kesely

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Thursday, February 1, 2024

[FIXED] Python 3 and Requests-Html: Trying to scrape a website - not getting the "real" html code back

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels