Issue
I want to scrape headlines and paragraph texts from Google News search page based on the term searched. I want to do that for first n pages.
I have wrote a piece of code for scraping the first page only, but I do not know how to modify my url
so that I can go to other pages to (page 2, 3 ...). That's the first problem that I have.
Second problem is that I do not know how to scrape headlines. It always returns me empty list. I have tried multiple solutions but it always returns me empty list. (I do not think that page is dynamic).
On the other hand scraping paragraph text below the headline works perfectly. Can you tell me how to fix these two problems?
This is my code:
from bs4 import BeautifulSoup
import requests
term = 'cocacola'
# this is only for page 1, how to go to page 2?
url = 'https://www.google.com/search?q={0}&source=lnms&tbm=nws'.format(term)
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
# I think that this is not javascipt sensitive, its not dynamic
headline_results = soup.find_all('a', class_="l lLrAF")
#headline_results = soup.find_all('h3', class_="r dO0Ag") # also does not work
print(headline_results) #empty list, IDK why?
paragraph_results = soup.find_all('div', class_='st')
print(paragraph_results) # works
Solution
Problem One: Flipping the page.
In order to move to the next page you need to include start
keyword in your URL formatted string:
term = 'cocacola'
page = 2
url = 'https://www.google.com/search?q={}&source=lnms&tbm=nws&start={}'.format(
term, (page - 1) * 10
)
Problem Two: Scraping the headlines.
Google regenerates the names of classes, ids, etc. of DOM elements so your approach is likely to fail every time you retrieve some new, uncached information.
Answered By - Arn
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.