Monday, January 15, 2024

[FIXED] Can't get text with Beautifull Soup from between <p> </p>

January 15, 2024 beautifulsoup, html, parsing, python, python-requests No comments

Issue

import requests 
from bs4 import BeautifulSoup

URL = "https://habr.com/ru/hubs/gamedev/articles/" # Url to website

page = requests.get(URL).content
soup = BeautifulSoup(page, "html.parser")
post = soup.find("article", class_="tm-articles-list__item") # Last post thah i need to parse 

discription = post.find_all('p')
for post_text in discription:       # Trying to separate the text 
    text = post_text.get_text()

print(text)

Getting this error: File "d:\CODING\Projects\net N FV.py", line 14, in print(text) ^^^^ NameError: name 'text' is not defined. Or text that i dont need

On a website post's html code, that im parsing, looks like this:

<div class="article-formatted-body article-formatted-body article-formatted-body_version-2"> 
<p> 
"Сегодня первой игре из серии DOOM исполняется ровно 30 лет! Мы не могли обойти стороной это событие и в честь этого решили посмотреть, как же выглядит код этой легендарной игры спустя годы."
 </p>
<p></p> 
after:: 
</div>

Solution

The text you see on the page is stored inside <script> element. So to parse it you can use next example:

import re
import json

import requests
from bs4 import BeautifulSoup

URL = "https://habr.com/ru/hubs/gamedev/articles/"  # Url to website

page = requests.get(URL).text
data = re.search(r"window\.__INITIAL_STATE__=(.*}});", page).group(1)

data = json.loads(data)

for a in sorted(
    data["articlesList"]["articlesList"].values(),
    key=lambda k: k["timePublished"],
    reverse=True,
):
    print(a["titleHtml"])
    print(BeautifulSoup(a["leadData"]["textHtml"], "html.parser").text)

    # we want just first article
    break

Prints:

30 лет DOOM: новый код — новые баги
Сегодня первой игре из серии DOOM исполняется ровно 30 лет! Мы не могли обойти стороной это событие и в честь этого решили посмотреть, как же выглядит код этой легендарной игры спустя годы.

Answered By - Andrej Kesely

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Monday, January 15, 2024

[FIXED] Can't get text with Beautifull Soup from between <p> </p>

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels