Issue
Using Python/BeautifulSoup, I'm trying to get the post title and URL from every result returned on Reddit.
Below is part of my code that retrieves all Reddit search results.
url = 'https://www.reddit.com/search/?q=test'
r = s.get(url, headers=headers_Get)
soup = BeautifulSoup(r.text, "html.parser")
results = soup.find_all('a', attrs={'data-click-id':'body'})
for result in results:
print(result.prettify())
title_post = result.find('h3').text
url_post = result.find('a')['href']
soup.find_all('a', attrs={'data-click-id':'body'})
appears to return a list of all search results. This is working as I'm expecting / hoping.
by doing print(result)
, I can validate that it is returning what I need. Below is the result of print(result.prettify())
:
<a class="SQnoC3ObvgnGjWt90zD9Z _2INHSNB8V5eaWp4P0rY_mE" data-click-id="body" href="/r/Cricket/comments/kunmyt/match_thread_3rd_test_australia_v_india_day_5/">
<div class="_2SdHzo12ISmrC8H86TgSCp _1zpZYP8cFNLfLDexPY65Y7" style="--posttitletextcolor:#222222">
<h3 class="_eYtD2XCVieq6emjKBH3m">
<span style="font-weight:normal">Match Thread: 3rd
<em style="font-weight:700">Test
</em>- Australia v India, Day 5
</span>
</h3>
</div>
</a>
title_post = result.find('h3').text
extracts the title associated with the comment or post. It is working as expected / hoped.
The problem that I have is with retrieving the address of the post (see href=):
<a class="SQnoC3ObvgnGjWt90zD9Z _2INHSNB8V5eaWp4P0rY_mE" data-click-id="body" href="/r/Cricket/comments/kunmyt/match_thread_3rd_test_australia_v_india_day_5/">
The line url_post = result.find('a')['href']
returns an error TypeError: 'NoneType' object is not subscriptable
.
If I could use the "result" as a string, then I could just look for href within it. Something like:
loc = result.text.find('href=')
print(result.text[loc:])
Obviously, this won't work:
result.text
does not return the HTML code, but just the string "Match Thread: 3rd Test - Australia v India, Day 5"
Question 1: Is there a way to return only the href="" component?
Question 2: Is there a way to convert the soup object "result" into plain text while keeping the HTML components? If it was possible, then I'd have an easy workaround.
Solution
The href
is already in the .attrs
of result
:
>>> for result in results:
... print(result.attrs)
...
{'data-click-id': 'body', 'class': ['SQnoC3ObvgnGjWt90zD9Z', '_2INHSNB8V5eaWp4P0rY_mE'], 'href': '/r/Cricket/comments/kunmyt/match_thread_3rd_test_australia_v_india_day_5/'}
...
so don't call the .find()
method, instead access the href
value using the [key]
notation (like a dictionary).
In your example:
for result in results:
url_post = result["href"]
print(url_post)
Output:
/r/Cricket/comments/kunmyt/match_thread_3rd_test_australia_v_india_day_5/
/r/Cricket/comments/ku008u/match_thread_3rd_test_australia_v_india_day_4/
/r/Cricket/comments/ktcg7n/match_thread_3rd_test_australia_v_india_day_3/
...
Answered By - MendelG
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.