Issue
I would like to scrape data (e.g., market capitalization, PE-ratio, etc.) from Google Finance using the BeautifulSoup-library of Python. However, when I try to extract certain passages (like "div", "tr", "td") from the html-code of the corresponding Google Finance site, using the "find_all" function, I always receive an empty list (i.e., the "base" object in the code below is empty).
During debugging, I printed the "soup" object and compared its content with the corresponding html-code. What surprised me was that the content of the "soup" object differs from the content of the html-code. I would expect that both should match in order to extract data successfully.
from bs4 import BeautifulSoup
import urllib.request
opener = urllib.request.build_opener()
opener.addheaders = [('User-Agent', 'Mozilla/5.0')]
response = opener.open('https://www.google.com/search?q=NASDAQ:GOOGL')
soup = BeautifulSoup(response, 'html.parser')
base = soup.find_all('div',{'class':'ZSM8k'})
print(soup)
print(base)
Solution
It is entirely up to the server what content it serves you, so the best you can do is to make sure that your request looks like the request sent by the browser as much as possible. In your case, this might mean:
opener.addheaders = [('User-Agent', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.90 Safari/537.36')]
If I am not mistaken, this gives you what you want. You can try to remove irrelevant parts by trial-and-error if you want.
Answered By - Imperishable Night
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.