Issue
I have a page that has Auction details I like to extract. The data from these tables changes in size because there are times that not all data points are available. Is there a way to create missing columns even if they are missing in order to keep the table consistent.
This is an example of a table that has some data points missing
<div class="AUCTION_DETAILS">
<table class="ad_tab" tabindex="0">
<tbody>
<tr>
<th class="AD_LBL" scope="row">Auction Type:</th>
<td class="AD_DTA">FORECLOSURE</td>
</tr>
<tr>
<th class="AD_LBL" scope="row" aria-label="Case Number">Case #:</th>
<td class="AD_DTA"><a
href="/index.cfm?zaction=auction&zmethod=details&AID=1173868&bypassPage=1">2009-020878-CA-01</a>
</td>
</tr>
<tr>
<th class="AD_LBL" scope="row">Final Judgment Amount:</th>
<td class="AD_DTA">$183,284.23</td>
</tr>
<tr>
<th class="AD_LBL" scope="row">Parcel ID:</th>
<td class="AD_DTA"></td>
</tr>
<tr>
<th class="AD_LBL" scope="row">Plaintiff Max Bid:</th>
<td class="AD_DTA ASTAT_MSGPB">Hidden</td>
</tr>
</tbody>
</table>
</div>
This is an example of a table that has all the data points
<div class="AUCTION_DETAILS">
<table class="ad_tab" tabindex="0">
<tbody>
<tr>
<th class="AD_LBL" scope="row">Auction Type:</th>
<td class="AD_DTA">FORECLOSURE</td>
</tr>
<tr>
<th class="AD_LBL" scope="row" aria-label="Case Number">Case #:</th>
<td class="AD_DTA"><a
href="/index.cfm?zaction=auction&zmethod=details&AID=1312124&bypassPage=1">2013-037297-CA-01</a>
</td>
</tr>
<tr>
<th class="AD_LBL" scope="row">Final Judgment Amount:</th>
<td class="AD_DTA">$242,181.24</td>
</tr>
<tr>
<th class="AD_LBL" scope="row">Parcel ID:</th>
<td class="AD_DTA">04-3106-030-0070</td>
</tr>
<tr>
<th class="AD_LBL" scope="row">Property Address:</th>
<td class="AD_DTA">131 W 41 ST</td>
</tr>
<tr>
<th class="AD_LBL" scope="row"></th>
<td class="AD_DTA">HIALEAH, FL- 33012</td>
</tr>
<tr>
<th class="AD_LBL" scope="row">Assessed Value:</th>
<td class="AD_DTA">$201,701.00</td>
</tr>
<tr>
<th class="AD_LBL" scope="row">Plaintiff Max Bid:</th>
<td class="AD_DTA ASTAT_MSGPB">Hidden</td>
</tr>
</tbody>
</table>
</div>
The short table is missing all the fields below in this example. But that may vary PropertyAddress, PropertyAddress1, AssessedValue, PlaintiffMaxBid
How would I extract existing columns and create NaN columns for the missing columns so table size is consistent.
Below are 2 samples one that has all data points and an other that has missing data points
Solution
EDIT
BeautifulSoup
- Scrape your tables, store results in a list
of dicts
and create a dataframe
- pandas
will fill values from non existing keys with NaN
- See also full example.
for table in soup.select('table.ad_tab'):
data.append(dict([(row.th.text,row.td.text) for row in table.select('tr') ]))
Pandas
- Iterate list
of urls, pd.read_html()
your tables, pd.tranpose()
and finally concate
them:
import pandas as pd
urls = ['https://projectcodesamples.s3.amazonaws.com/CompleteFile.html','https://projectcodesamples.s3.amazonaws.com/shortFile.html']
pd.concat([df.set_index(0).T for url in urls for df in pd.read_html(url, attrs={'class':'ad_tab'})],ignore_index=True)
Example
from bs4 import BeautifulSoup
import pandas as pd
import requests
urls = ['https://projectcodesamples.s3.amazonaws.com/CompleteFile.html','https://projectcodesamples.s3.amazonaws.com/shortFile.html']
data = []
for url in urls:
response = requests.get(url)
soup = BeautifulSoup(response.text)
for table in soup.select('table.ad_tab'):
data.append(dict([(row.th.text,row.td.text) for row in table.select('tr') ]))
pd.DataFrame(data)
Auction Type: | Case #: | Final Judgment Amount: | Parcel ID: | Plaintiff Max Bid: | Property Address: | Assessed Value: | ||
---|---|---|---|---|---|---|---|---|
0 | FORECLOSURE | 2009-020878-CA-01 | $183,284.23 | Hidden | nan | nan | nan | |
1 | FORECLOSURE | 2013-037297-CA-01 | $242,181.24 | 04-3106-030-0070 | Hidden | 131 W 41 ST | HIALEAH, FL- 33012 | $201,701.00 |
... | ... | ... | ... | ... | ... | ... | ... | ... |
38 | FORECLOSURE | 2015-011730-CA-01 | $558,875.05 | nan | nan | nan | Hidden | |
39 | FORECLOSURE | 2015-012472-CA-01 | $595,150.12 | nan | nan | nan | Hidden |
It is also possible to append()
a new dict
as row to your existing dataframe
:
df.append({'Auction Type:': 'some value',
'Case #:': 'some value',
'Final Judgment Amount:': 'some value',
'Parcel ID:': 'some value',
'Plaintiff Max Bid:': ''}, ignore_index=True)
Auction Type: | Case #: | Final Judgment Amount: | Parcel ID: | Plaintiff Max Bid: | Property Address: | Assessed Value: | ||
---|---|---|---|---|---|---|---|---|
0 | FORECLOSURE | 2009-020878-CA-01 | $183,284.23 | Hidden | nan | nan | nan | |
1 | FORECLOSURE | 2013-037297-CA-01 | $242,181.24 | 04-3106-030-0070 | Hidden | 131 W 41 ST | HIALEAH, FL- 33012 | $201,701.00 |
2 | some value | some value | some value | some value | nan | nan | nan |
Answered By - HedgeHog
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.