Issue
I have tried to scrape a table with BeautifulSoup
and out of my 4 attempts first 3 are not working I don't know why!
In fourth approach I have tried using pandas
but results are not specific anymore.
import requests
import bs4
res = requests.get(
"https://www.moneycontrol.com/stocks/marketstats/industry-classification/bse/aerospace-defence.html")
soup = bs4.BeautifulSoup(res.text, 'lxml')
# 1st try by copy selector from inspect element
table = soup.find_all(
'#mc_content > section > section > div.clearfix.stat_container > div.columnst.FR.wbg.brdwht > div > div.bsr_table.hist_tbl_hm.PR.Ohidden')
print(table)
# 2nd try by specifically writing class by attribute method
table = soup.find_all(
'div', attrs={'class': 'bsr_table.hist_tbl_hm.PR.Ohidden'})
print(table)
# 3rd conventional style
table = soup.find('table')
table_rows = table.find('tr')
for tr in table_rows:
td = tr.find_all('td')
row = [i.text() for i in td]
print(td)
import pandas as pd
# 4th by pandas
dfs = pd.read_html(
'https://www.moneycontrol.com/stocks/marketstats/industry-classification/bse/aerospace-defence.html')
for df in dfs:
print(df)
Output I got:
0 Hindustan Aeron Add to Watchlist | Portfolio... ... 627.53
1 5-Day ... NaN
2 10-Day ... NaN
3 30-Day ... NaN
4 3-Day ... NaN
5 5-Day ... NaN
6 8-Day ... NaN
7 TAAL Enterprise Add to Watchlist | Portfolio... ... 135.34
8 5-Day ... NaN
9 10-Day ... NaN
10 30-Day ... NaN
11 3-Day ... NaN
12 5-Day ... NaN
13 8-Day ... NaN
14 Taneja Aerospac Add to Watchlist | Portfolio... ... 21.76
15 5-Day ... NaN
16 10-Day ... NaN
17 30-Day ... NaN
18 3-Day ... NaN
19 5-Day ... NaN
20 8-Day ... NaN
But Output I want:
is a data frame having columns 1)Open 2) High 3) Low 4) Price 5)Current Price 6) Percent change 7) Sector:- aerospace defence.
Thanks for answering doubt and contributing to it.
Solution
Basically the page is loaded via JavaScript
so you couldn't use requests
module to parse a JS
which is rendered
dynamically once the page loads.
You can use selenium
for such task. otherwise you could do it with HTMLSession
from requests_html
module which render the JavaScript
on the fly.
from selenium import webdriver
from selenium.webdriver.firefox.options import Options
import pandas as pd
options = Options()
options.add_argument('--headless')
driver = webdriver.Firefox(options=options)
driver.get("https://www.moneycontrol.com/stocks/marketstats/industry-classification/bse/aerospace-defence.html")
df = pd.read_html(driver.page_source)[0]
print(df)
df.to_csv("result.csv", index=False)
driver.quit()
output: VIEW-ONLINE
Company Name ... 5 Day Performance -18.45 (-2.7%) 03-Mar-20 6... 5 Day Performance Volume Lower Circuit Upper Circuit VWAP SMA Deliver -0.3 (-0.19%) 03-Mar-20 15...ables P/E P/B 0.55 (2.26%) 03-Mar-20 24.4...
0 Hindustan Aeron Add to Watchlist | Portfolio... ... 02-Mar-20 666.10 -18.45 (-2.7%) 03-Mar-20 6...
1 TAAL Enterprise Add to Watchlist | Portfolio... ... 02-Mar-20 160.00 -0.3 (-0.19%) 03-Mar-20 15...
2 Taneja Aerospac Add to Watchlist | Portfolio... ... 02-Mar-20 24.90
0.55 (2.26%) 03-Mar-20 24.4...
[3 rows x 9 columns]
Note: for your knowledge, Pay attention that
pd.read_html
is actually returning a list. so you don't need to loop over it. you can just index it usingdf[0]
for example.
Answered By - αԋɱҽԃ αмєяιcαη
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.