Thursday, December 16, 2021

[FIXED] Parsing a Reddit search result with BeautifulSoup and Python

December 16, 2021 beautifulsoup, python No comments

Issue

Using Python/BeautifulSoup, I'm trying to get the post title and URL from every result returned on Reddit.

Below is part of my code that retrieves all Reddit search results.

url = 'https://www.reddit.com/search/?q=test'
r = s.get(url, headers=headers_Get)
soup = BeautifulSoup(r.text, "html.parser")
results = soup.find_all('a', attrs={'data-click-id':'body'})
for result in results:
    print(result.prettify())
    title_post = result.find('h3').text
    url_post = result.find('a')['href']

soup.find_all('a', attrs={'data-click-id':'body'}) appears to return a list of all search results. This is working as I'm expecting / hoping.

by doing print(result), I can validate that it is returning what I need. Below is the result of print(result.prettify()):

<a class="SQnoC3ObvgnGjWt90zD9Z _2INHSNB8V5eaWp4P0rY_mE" data-click-id="body" href="/r/Cricket/comments/kunmyt/match_thread_3rd_test_australia_v_india_day_5/">
<div class="_2SdHzo12ISmrC8H86TgSCp _1zpZYP8cFNLfLDexPY65Y7" style="--posttitletextcolor:#222222">
<h3 class="_eYtD2XCVieq6emjKBH3m">
<span style="font-weight:normal">Match Thread: 3rd
<em style="font-weight:700">Test
</em>- Australia v India, Day 5
</span>
</h3>
</div>
</a>

title_post = result.find('h3').text extracts the title associated with the comment or post. It is working as expected / hoped.

The problem that I have is with retrieving the address of the post (see href=):

<a class="SQnoC3ObvgnGjWt90zD9Z _2INHSNB8V5eaWp4P0rY_mE" data-click-id="body" href="/r/Cricket/comments/kunmyt/match_thread_3rd_test_australia_v_india_day_5/">

The line url_post = result.find('a')['href'] returns an error TypeError: 'NoneType' object is not subscriptable.

If I could use the "result" as a string, then I could just look for href within it. Something like:

loc = result.text.find('href=')
print(result.text[loc:])

Obviously, this won't work: result.text does not return the HTML code, but just the string "Match Thread: 3rd Test - Australia v India, Day 5"

Question 1: Is there a way to return only the href="" component?

Question 2: Is there a way to convert the soup object "result" into plain text while keeping the HTML components? If it was possible, then I'd have an easy workaround.

Solution

The href is already in the .attrs of result:

>>> for result in results:
...     print(result.attrs)
...
{'data-click-id': 'body', 'class': ['SQnoC3ObvgnGjWt90zD9Z', '_2INHSNB8V5eaWp4P0rY_mE'], 'href': '/r/Cricket/comments/kunmyt/match_thread_3rd_test_australia_v_india_day_5/'}
...

so don't call the .find() method, instead access the href value using the [key] notation (like a dictionary).

In your example:

for result in results:
    url_post = result["href"]
    print(url_post)

Output:

/r/Cricket/comments/kunmyt/match_thread_3rd_test_australia_v_india_day_5/
/r/Cricket/comments/ku008u/match_thread_3rd_test_australia_v_india_day_4/
/r/Cricket/comments/ktcg7n/match_thread_3rd_test_australia_v_india_day_3/
...

Answered By - MendelG

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Thursday, December 16, 2021

[FIXED] Parsing a Reddit search result with BeautifulSoup and Python

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels