Issue
I want to scrape data off of this website:
https://www.gurufocus.com/stock/AAPL/
This is the current version of the webscraper:
import pandas as pd
import requests
from bs4 import BeautifulSoup
import re
ls=['Ticker','PE Ratio','Gross Margin %','Debt-to-EBITDA','GF Value Rank','Financial Strength','Profitability Rank']
symbols = ['AAPL', 'TSLA']
df = pd.DataFrame(columns=ls)
for t in symbols:
req = requests.get("https://www.gurufocus.com/stock/"+t)
if req.status_code !=200:
continue
soup = BeautifulSoup(req.content, 'html.parser')
scores = [t]
for val in ls[1:]:
scores.append(soup.find('a', string=re.compile(val)).find_next('td').text)
df.loc[len(df)] = scores
df
And this is the output that I get:
The normal ratios are correctly obtained, but the GF Value Rank, the Financial Strenght and Profitability Rank couldn't be obtained with the program above.
I inspected the webcode of the gurufocus website and came across this div section with the id="financial-strength" and id="profitability", but I'm not sure how to extract the scores from this information.
As far as the GF Value Rank is concerned, I only found a span section that covers that score, but nothing like a javascript td entry or something similar.
How do I need to change my code to obtain the last three scores in my table?
Solution
This is a very complex page and you should use css selectors (or even better off, probably, because I haven't tried for this - xpath with lxml).
So try it this way:
ls=['Ticker','Debt-to-EBITDA','Gross Margin %','PE Ratio','Financial Strength','Profitability Rank','GF Value Rank']
symbols = ['AAPL', 'TSLA']
rows = []
for t in symbols:
req = requests.get("https://www.gurufocus.com/stock/"+t)
if req.status_code !=200:
continue
soup = BeautifulSoup(req.content, 'html.parser')
scores = [t]
#this is where the css selectors come in - you have to use 4 of these
#because this is how the data is distributed in the page
measures = [mea.text.strip() for mea in soup.select('td.t-caption > a')]
vals = [va.text.strip() for va in soup.select('td.t-caption span.p-l-sm') ]
measures2 = [mea.text.strip() for mea in soup.select('h2.t-h6 >a')]
vals2 = [va.text.strip() for va in soup.select('div.flex.flex-center span.t-body-sm.m-l-md')]
all_meas = measures + measures2
all_vals = vals + vals2
for m,v in zip(all_meas,all_vals):
if m in ls:
scores.append(v)
rows.append(scores)
df = pd.DataFrame(rows,columns=ls)
df
Output:
Ticker Debt-to-EBITDA Gross Margin % PE Ratio Financial Strength Profitability Rank GF Value Rank
0 AAPL 0.91 43.31 27.93 7/10 10/10 6/10
1 TSLA 0.47 27.1 106.39 8/10 5/10 9/10
Answered By - Jack Fleeting
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.