Tuesday, October 4, 2022

[FIXED] Beautiful soup fining elements of a table that are of different sizes

October 04, 2022 beautifulsoup, python No comments

Issue

I have a page that has Auction details I like to extract. The data from these tables changes in size because there are times that not all data points are available. Is there a way to create missing columns even if they are missing in order to keep the table consistent.

This is an example of a table that has some data points missing

<div class="AUCTION_DETAILS">
    <table class="ad_tab" tabindex="0">
        <tbody>
            <tr>
                <th class="AD_LBL" scope="row">Auction Type:</th>
                <td class="AD_DTA">FORECLOSURE</td>
            </tr>
            <tr>
                <th class="AD_LBL" scope="row" aria-label="Case Number">Case #:</th>
                <td class="AD_DTA"><a
                        href="/index.cfm?zaction=auction&amp;zmethod=details&amp;AID=1173868&amp;bypassPage=1">2009-020878-CA-01</a>
                </td>
            </tr>
            <tr>
                <th class="AD_LBL" scope="row">Final Judgment Amount:</th>
                <td class="AD_DTA">$183,284.23</td>
            </tr>
            <tr>
                <th class="AD_LBL" scope="row">Parcel ID:</th>
                <td class="AD_DTA"></td>
            </tr>
            <tr>
                <th class="AD_LBL" scope="row">Plaintiff Max Bid:</th>
                <td class="AD_DTA ASTAT_MSGPB">Hidden</td>
            </tr>
        </tbody>
    </table>
</div>

This is an example of a table that has all the data points

<div class="AUCTION_DETAILS">
    <table class="ad_tab" tabindex="0">
        <tbody>
            <tr>
                <th class="AD_LBL" scope="row">Auction Type:</th>
                <td class="AD_DTA">FORECLOSURE</td>
            </tr>
            <tr>
                <th class="AD_LBL" scope="row" aria-label="Case Number">Case #:</th>
                <td class="AD_DTA"><a
                        href="/index.cfm?zaction=auction&amp;zmethod=details&amp;AID=1312124&amp;bypassPage=1">2013-037297-CA-01</a>
                </td>
            </tr>
            <tr>
                <th class="AD_LBL" scope="row">Final Judgment Amount:</th>
                <td class="AD_DTA">$242,181.24</td>
            </tr>
            <tr>
                <th class="AD_LBL" scope="row">Parcel ID:</th>
                <td class="AD_DTA">04-3106-030-0070</td>
            </tr>
            <tr>
                <th class="AD_LBL" scope="row">Property Address:</th>
                <td class="AD_DTA">131 W 41 ST</td>
            </tr>
            <tr>
                <th class="AD_LBL" scope="row"></th>
                <td class="AD_DTA">HIALEAH, FL- 33012</td>
            </tr>
            <tr>
                <th class="AD_LBL" scope="row">Assessed Value:</th>
                <td class="AD_DTA">$201,701.00</td>
            </tr>
            <tr>
                <th class="AD_LBL" scope="row">Plaintiff Max Bid:</th>
                <td class="AD_DTA ASTAT_MSGPB">Hidden</td>
            </tr>
        </tbody>
    </table>
</div>

The short table is missing all the fields below in this example. But that may vary PropertyAddress, PropertyAddress1, AssessedValue, PlaintiffMaxBid

How would I extract existing columns and create NaN columns for the missing columns so table size is consistent.

Below are 2 samples one that has all data points and an other that has missing data points

No_Data_Missing

Missing Data Points

Solution

EDIT

BeautifulSoup - Scrape your tables, store results in a list of dicts and create a dataframe - pandas will fill values from non existing keys with NaN - See also full example.

for table in soup.select('table.ad_tab'):
    data.append(dict([(row.th.text,row.td.text) for row in table.select('tr') ]))

Pandas - Iterate list of urls, pd.read_html() your tables, pd.tranpose() and finally concate them:

import pandas as pd
urls = ['https://projectcodesamples.s3.amazonaws.com/CompleteFile.html','https://projectcodesamples.s3.amazonaws.com/shortFile.html']
pd.concat([df.set_index(0).T for url in urls for df in pd.read_html(url, attrs={'class':'ad_tab'})],ignore_index=True)

Example

from bs4 import BeautifulSoup
import pandas as pd
import requests

urls = ['https://projectcodesamples.s3.amazonaws.com/CompleteFile.html','https://projectcodesamples.s3.amazonaws.com/shortFile.html']

data = []

for url in urls:
    response = requests.get(url)
    soup = BeautifulSoup(response.text)
    for table in soup.select('table.ad_tab'):
            data.append(dict([(row.th.text,row.td.text) for row in table.select('tr') ]))


pd.DataFrame(data)

	Auction Type:	Case #:	Final Judgment Amount:	Parcel ID:	Plaintiff Max Bid:	Property Address:		Assessed Value:
0	FORECLOSURE	2009-020878-CA-01	$183,284.23		Hidden	nan	nan	nan
1	FORECLOSURE	2013-037297-CA-01	$242,181.24	04-3106-030-0070	Hidden	131 W 41 ST	HIALEAH, FL- 33012	$201,701.00
...	...	...	...	...	...	...	...	...
38	FORECLOSURE	2015-011730-CA-01	$558,875.05		nan	nan	nan	Hidden
39	FORECLOSURE	2015-012472-CA-01	$595,150.12		nan	nan	nan	Hidden

It is also possible to append() a new dict as row to your existing dataframe:

df.append({'Auction Type:': 'some value',
  'Case #:': 'some value',
  'Final Judgment Amount:': 'some value',
  'Parcel ID:': 'some value',
  'Plaintiff Max Bid:': ''}, ignore_index=True)

	Auction Type:	Case #:	Final Judgment Amount:	Parcel ID:	Plaintiff Max Bid:	Property Address:		Assessed Value:
0	FORECLOSURE	2009-020878-CA-01	$183,284.23		Hidden	nan	nan	nan
1	FORECLOSURE	2013-037297-CA-01	$242,181.24	04-3106-030-0070	Hidden	131 W 41 ST	HIALEAH, FL- 33012	$201,701.00
2	some value	some value	some value	some value		nan	nan	nan

Answered By - HedgeHog

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Tuesday, October 4, 2022

[FIXED] Beautiful soup fining elements of a table that are of different sizes

Issue

Solution

EDIT

Example

0 comments:

Post a Comment

Popular Posts

Labels