Monday, December 6, 2021

[FIXED] How to capture table in a structured format from this website using Beautiful soup and Pandas?

December 06, 2021 beautifulsoup, dataframe, pandas, python, python-requests No comments

Issue

I want to scrape the table from this website "" and as it keeps getting updated hourly want to track changes as well. I tried scraping data using selenium but it was all in one column without any table. How to use pandas and Beautiful Soup to scrape the table in a structured format and track changes as well. This is the code I'm trying to figure out.

import pandas as pd
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, "html.parser")
table = soup.find('table', attrs={'id':'subs noBorders evenRows'})
table_rows = table.find_all('tr')

res = []
for tr in table_rows:
    td = tr.find_all('td')
    row = [tr.text.strip() for tr in td if tr.text.strip()]
    if row:
        res.append(row)


df = pd.DataFrame(res, columns=["Notice No","Subject","Segment Name","Category Name","Department","PDF"])
print(df)

It would be a help if you can help me getting data and how to keep track of new data whenever I run the script again.

Solution

Be informed that you don't need to include params as the desired information presented within main page. I've left it for you in case if you will scrape different id.

Also be informed that i skipped PDF as it's will shown NAN values since the pdf links is not an hyperlink . it's jsut a logo icon which stored within the server. but once you click on the pdf logo, then it's make a post request to the target to download the file. Based on that without a clear information from you so here's an answer for your requirements.

import requests
import pandas as pd

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:80.0) Gecko/20100101 Firefox/80.0"
}

params = {
    'id': 0,
    'txtscripcd': '',
    'pagecont': '',
    'subject': ''
}


def main(url):
    r = requests.get(url, params=params, headers=headers)
    df = pd.read_html(r.content)[-1].iloc[:, :-1]
    print(df)


main("https://www.bseindia.com/markets/MarketInfo/NoticesCirculars.aspx")

Output:

    Notice No   Subject     Segment Name    Category Name   Department
0   20200923-2  Offer to Buy – Acquisition Window (Delisting) ...   Equity  Trading     Trading Operations
1   20200923-1  Change in Name of the Company.  Debt    Company related     Listing Operations

Answered By - αԋɱҽԃ αмєяιcαη

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Monday, December 6, 2021

[FIXED] How to capture table in a structured format from this website using Beautiful soup and Pandas?

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels