Saturday, October 16, 2021

[FIXED] What are error's regarding Beautiful Soup?

October 16, 2021 beautifulsoup, css-selectors, dataframe, pandas, python No comments

Issue

I have tried to scrape a table with BeautifulSoup and out of my 4 attempts first 3 are not working I don't know why!

In fourth approach I have tried using pandas but results are not specific anymore.

import requests
import bs4

res = requests.get(
    "https://www.moneycontrol.com/stocks/marketstats/industry-classification/bse/aerospace-defence.html")

soup = bs4.BeautifulSoup(res.text, 'lxml')

# 1st try by copy selector from inspect element
table = soup.find_all(
    '#mc_content > section > section > div.clearfix.stat_container > div.columnst.FR.wbg.brdwht > div > div.bsr_table.hist_tbl_hm.PR.Ohidden')
print(table)

# 2nd try by specifically writing class by attribute method
table = soup.find_all(
    'div', attrs={'class': 'bsr_table.hist_tbl_hm.PR.Ohidden'})
print(table)

# 3rd conventional style
table = soup.find('table')
table_rows = table.find('tr')
for tr in table_rows:
        td = tr.find_all('td')
        row = [i.text() for i in td]
        print(td)

import pandas as pd

# 4th by pandas

dfs = pd.read_html(
    'https://www.moneycontrol.com/stocks/marketstats/industry-classification/bse/aerospace-defence.html')
for df in dfs:
        print(df)

Output I got:

0   Hindustan Aeron  Add to  Watchlist | Portfolio...  ...      627.53
1                                               5-Day  ...         NaN
2                                              10-Day  ...         NaN
3                                              30-Day  ...         NaN
4                                               3-Day  ...         NaN
5                                               5-Day  ...         NaN
6                                               8-Day  ...         NaN
7   TAAL Enterprise  Add to  Watchlist | Portfolio...  ...      135.34
8                                               5-Day  ...         NaN
9                                              10-Day  ...         NaN
10                                             30-Day  ...         NaN
11                                              3-Day  ...         NaN
12                                              5-Day  ...         NaN
13                                              8-Day  ...         NaN
14  Taneja Aerospac  Add to  Watchlist | Portfolio...  ...       21.76
15                                              5-Day  ...         NaN
16                                             10-Day  ...         NaN
17                                             30-Day  ...         NaN
18                                              3-Day  ...         NaN
19                                              5-Day  ...         NaN
20                                              8-Day  ...         NaN

But Output I want:

is a data frame having columns 1)Open 2) High 3) Low 4) Price 5)Current Price 6) Percent change 7) Sector:- aerospace defence.

Thanks for answering doubt and contributing to it.

Solution

Basically the page is loaded via JavaScript so you couldn't use requests module to parse a JS which is rendered dynamically once the page loads.

You can use selenium for such task. otherwise you could do it with HTMLSession from requests_html module which render the JavaScript on the fly.

from selenium import webdriver
from selenium.webdriver.firefox.options import Options
import pandas as pd

options = Options()
options.add_argument('--headless')
driver = webdriver.Firefox(options=options)

driver.get("https://www.moneycontrol.com/stocks/marketstats/industry-classification/bse/aerospace-defence.html")


df = pd.read_html(driver.page_source)[0]

print(df)
df.to_csv("result.csv", index=False)

driver.quit()

output: VIEW-ONLINE

                                        Company Name  ...  5 Day Performance -18.45 (-2.7%)  03-Mar-20  6...  5 Day Performance  Volume  Lower Circuit  Upper Circuit  VWAP  SMA Deliver -0.3 (-0.19%)  03-Mar-20  15...ables  P/E  P/B                                                             0.55 (2.26%)  03-Mar-20  24.4...
0  Hindustan Aeron  Add to  Watchlist | Portfolio...  ...  02-Mar-20  666.10 -18.45 (-2.7%)  03-Mar-20  6...
1  TAAL Enterprise  Add to  Watchlist | Portfolio...  ...  02-Mar-20  160.00 -0.3 (-0.19%)  03-Mar-20  15...
2  Taneja Aerospac  Add to  Watchlist | Portfolio...  ...  02-Mar-20  24.90 
0.55 (2.26%)  03-Mar-20  24.4...

[3 rows x 9 columns]

Note: for your knowledge, Pay attention that pd.read_html is actually returning a list. so you don't need to loop over it. you can just index it using df[0] for example.

Answered By - αԋɱҽԃ αмєяιcαη

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Saturday, October 16, 2021

[FIXED] What are error's regarding Beautiful Soup?

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels