Issue
I'm teaching myself beautifulsoup and trying to scrape some reddit titles. The list, however, only contains 8 reddit titles. That's weird, since the page contains a lot more reddit titles (I tried saving it). What am I doing wrong and how can I get it to scrape the whole page?
This is my code:
from bs4 import BeautifulSoup as bs
import requests
page = requests.get("https://www.reddit.com/r/RedditWritesSeinfeld/search/?q=flair%3Aprompt&restrict_sr=1&sr_nsfw=&t=all&sort=top")
soup = bs(page.content, 'html.parser')
soupbody = soup.select("div h3") #Selects one element lists of all reddit titles
def listreddittitles(l): #returns a list of all reddit post titles as strings
temp = []
for i in l:
temp.append(i.contents[0])
return temp
reddittitles = listreddittitles(soupbody)
print(len(reddittitles))
input()
Solution
The data is most probably loaded dynamically via JavaScript, so you need to simulate the Ajax with requests
module.
Or
You can append .json
to the URL and receive data from the server in Json format:
import json
import requests
headers = {
"User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:102.0) Gecko/20100101 Firefox/102.0"
}
params = {
"q": "flair:prompt",
"restrict_sr": "1",
"sr_nsfw": "",
"t": "all",
"sort": "top",
}
data = requests.get(
"https://www.reddit.com/r/RedditWritesSeinfeld/search/.json",
params=params,
headers=headers,
).json()
# uncomment this to print all data:
# print(json.dumps(data, indent=4))
for i, d in enumerate(data["data"]["children"], 1):
print("{:<3} {}".format(i, d["data"]["title"][:50]))
Prints:
1 Jerry’s Australian girlfriend buys him a birthday
2 George impresses his girlfriend with his generosit
3 George accidentally eats a pot brownie at a party.
4 George accidentally writes "Congrats! Way to Go!"
5 What would George’s social media profiles look lik
6 In a very special episode, George, at his father F
7 Elaine starts dating a guy named George, so the re
8 There is a big protest in the city. George is inte
9 Jerry, horrified to find a mouse in his apartment,
10 George tries to find out what Doctor his Doctor se
11 George throws a tantrum on a date when they go to
12 Jerry dates a marketing exec he met through Elaine
13 A true crime podcast accuses a notorious killer of
14 Kramer notices that a lot of runners are using the
15 “The George” - The name “George” becomes a viral t
16 Jerry dates a beautiful woman who has a quirk that
17 "She's a floor-sleeper!"
18 Jerry’s new girlfriend starts saying “y’all” even
19 George developes an app that rates public restroom
20 Elaine's boyfriend's parrot says "Julia" over and
21 The Gauntlet - George tries to wipe out half of al
22 George’s gf’s weighted blanket is too heavy for in
23 [Prompt] George saves a pregnant woman who is so g
24 Jerry's Italian girlfriend calls George "Giorgio;"
25 Frank might be deported to Italy due to an old pap
Or:
Use the url in the form of "https://old.reddit.com/r/RedditWritesSeinfeld/search/
(note the old. at the beginning) and parse it with beautifulsoup
library)
Answered By - Andrej Kesely
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.