Sunday, April 10, 2022

[FIXED] Trying to extract a table from webpage using BeautifulSoup (table inconsistent with real data)

April 10, 2022 beautifulsoup, dataframe, html, pandas, python No comments

Issue

So far I have exported the link to my notebook are parsed the phrase using beautiful soup:

html_data = requests.get('https://www.macrotrends.net/stocks/charts/TSLA/tesla/revenue')
soup = BeautifulSoup(html_data.text, 'lxml')

Then I tried to basically make a table that's only containing revenue (Telsa Quarterly Revenue) here (trying to omit Nan values):

tesla_revenue = pd.DataFrame(columns=["Date", "Revenue"])
table = soup.find('table', attrs={'class': 'historical_data_table table'})
for result in table:
    if table.find('th').getText().startswith("Tesla Quarterly Revenue"):
        for row in result.find_all('tbody').find_all("tr"):
            col = row.find("td")
            if len(col) != 2: continue
            Date = col[0].text
            Revenue = col[1].text
            tesla_revenue = tesla_revenue.append({"Date":Date,  "Revenue":Revenue}, ignore_index=True)

tesla_revenue = tesla_revenue.apply (pd.to_numeric, errors='coerce')
tesla_revenue = tesla_revenue.dropna()

Then when I tried to print out the tail of the table, I just get this:

| Date | Revenue |

(only the headers)

I think I might done something wrong when I made my table, but I can't be sure. Any help would be appreciated.

Solution

There are few mistakes in this code but main problem is there are 4 tables in HTML but you use find('table', ...) instead of find_all('table',...) so you get only first table but Revenue is in other table (probably in second table).

import requests
from bs4 import BeautifulSoup
import pandas as pd

response = requests.get('https://www.macrotrends.net/stocks/charts/TSLA/tesla/revenue')
soup = BeautifulSoup(response.text, 'lxml')

all_tables = soup.find_all('table', attrs={'class': 'historical_data_table table'})

tesla_revenue = pd.DataFrame(columns=["Date", "Revenue"])

for table in all_tables:
    if table.find('th').getText().startswith("Tesla Quarterly Revenue"):
        for row in table.find_all("tr"):
            col = row.find_all("td")  
            if len(col) == 2: 
                date = col[0].text
                revenue = col[1].text.replace('$', '').replace(',', '')
                tesla_revenue = tesla_revenue.append({"Date": date, "Revenue": revenue}, ignore_index=True)

#tesla_revenue = tesla_revenue.apply(pd.to_numeric, errors='coerce')
#tesla_revenue = tesla_revenue.dropna()

print(tesla_revenue)

Answered By - furas

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Sunday, April 10, 2022

[FIXED] Trying to extract a table from webpage using BeautifulSoup (table inconsistent with real data)

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels