Issue
I am trying to scrape a simple yahoofinance page, my code looks like this :
import requests
from bs4 import BeautifulSoup
url = "https://finance.yahoo.com/quote/AMZN"
headers={'USER-AGENT': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36"}
response = requests.get(url,headers=headers)
soup = BeautifulSoup(response.text,'lxml')
print(soup.prettify())
I expected a HTML document to be printed, instead I am getting something that starts like this:
While debugging, I can clearly see that the response.code is 200 and that the soup contains the expected HTML documents:
Even without going through BeautifulSoup, and just using:
url = "https://finance.yahoo.com/quote/AMZN"
headers={'USER-AGENT': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36"}
response = requests.get(url,headers=headers)
print(response.text)
I am getting the same result.
Any idea about what I am doing wrong ?
I expected a HTML document but I am getting random numbers and letters
Solution
I think your code is actually working fine.
If you inspect the source of that page, there is an embedded javascript that contains a block of encoded text as the value of root.App.main.context.dispatcher.stores
. That block looks very much like what you're showing in your question. There's around 1.5MB of data contained in that script.
That means that any attempt to print the source of that page is going to generate too much output to be useful, but it doesn't prevent you from using BeautifulSoup to perform queries on the page:
>>> res = requests.get('https://finance.yahoo.com/quote/AMZN')
>>> soup = bs4.BeautifulSoup(res.text)
>>> x = soup.find('span', string='Previous Close')
>>> x.findParent().findNextSibling().text
'147.42'
Answered By - larsks
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.