Friday, August 19, 2022

[FIXED] Webscraper Only Obtains Some Data

August 19, 2022 beautifulsoup, pandas, python, web-scraping No comments

Issue

I want to scrape data off of this website:

https://www.gurufocus.com/stock/AAPL/

This is the current version of the webscraper:

import pandas as pd
import requests
from bs4 import BeautifulSoup
import re

ls=['Ticker','PE Ratio','Gross Margin %','Debt-to-EBITDA','GF Value Rank','Financial Strength','Profitability Rank']
symbols = ['AAPL', 'TSLA']
df = pd.DataFrame(columns=ls)

for t in symbols:
    req = requests.get("https://www.gurufocus.com/stock/"+t)
    if req.status_code !=200:
        continue
    soup = BeautifulSoup(req.content, 'html.parser')
    scores = [t]
    for val in ls[1:]:
        scores.append(soup.find('a', string=re.compile(val)).find_next('td').text)
    df.loc[len(df)] = scores
df

And this is the output that I get:

The normal ratios are correctly obtained, but the GF Value Rank, the Financial Strenght and Profitability Rank couldn't be obtained with the program above.

I inspected the webcode of the gurufocus website and came across this div section with the id="financial-strength" and id="profitability", but I'm not sure how to extract the scores from this information.

As far as the GF Value Rank is concerned, I only found a span section that covers that score, but nothing like a javascript td entry or something similar.

How do I need to change my code to obtain the last three scores in my table?

Solution

This is a very complex page and you should use css selectors (or even better off, probably, because I haven't tried for this - xpath with lxml).

So try it this way:

ls=['Ticker','Debt-to-EBITDA','Gross Margin %','PE Ratio','Financial Strength','Profitability Rank','GF Value Rank']
symbols = ['AAPL', 'TSLA']
rows = []
for t in symbols:
    req = requests.get("https://www.gurufocus.com/stock/"+t)
    if req.status_code !=200:
        continue
    soup = BeautifulSoup(req.content, 'html.parser')
    scores = [t]

    #this is where the css selectors come in - you have to use 4 of these
    #because this is how the data is distributed in the page

    measures = [mea.text.strip() for mea in soup.select('td.t-caption > a')]
    vals = [va.text.strip() for va in soup.select('td.t-caption span.p-l-sm') ]
    measures2 = [mea.text.strip() for mea in soup.select('h2.t-h6 >a')]
    vals2 = [va.text.strip() for va in soup.select('div.flex.flex-center span.t-body-sm.m-l-md')]

    all_meas = measures + measures2
    all_vals = vals + vals2

    for m,v in zip(all_meas,all_vals):
        if m in ls:
            scores.append(v)
    rows.append(scores)
df = pd.DataFrame(rows,columns=ls)
df

Output:

    Ticker  Debt-to-EBITDA  Gross Margin %  PE Ratio    Financial Strength  Profitability Rank  GF Value Rank
0   AAPL    0.91            43.31           27.93        7/10       10/10   6/10
1   TSLA    0.47            27.1            106.39       8/10        5/10   9/10

Answered By - Jack Fleeting

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Friday, August 19, 2022

[FIXED] Webscraper Only Obtains Some Data

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels