Issue
I'd like to detect the header of an HTML table when that table does not have <thead>
elements. (MediaWiki, which drives Wikipedia, does not support <thead>
elements.) I'd like to do this with python in both BeautifulSoup and lxml. Let's say I already have a table
object and I'd like to get out of it a thead
object, a tbody
object, and a tfoot
object.
Currently, parse_thead
does the following when the <thead>
tag is present:
- In BeautifulSoup, I get table objects with
doc.find_all('table')
and I can usetable.find_all('thead')
- In lxml, I get table objects with
doc.xpath()
on an xpath_expr on//table
, and I can usetable.xpath('.//thead')
and parse_tbody
and parse_tfoot
work in the same way. (I did not write this code and I am not experienced with either BS or lxml.) However, without a <thead>
, parse_thead
returns nothing and parse_tbody
returns the header and the body together.
I append a wikitable instance below as an example. It lacks <thead>
and <tbody>
. Instead all rows, header or not, are enclosed in <tr>...</tr>
, but header rows have <th>
elements and body rows have <td>
elements. Without <thead>
, it seems like the right criterion for identifying the header is "from the start, put rows into the header until you find a row that has an element that's not <th>
".
I'd appreciate suggestions on how I could write parse_thead
and parse_tbody
. Without much experience here, I would think I could either
- Dive into the table object and manually insert
thead
andtbody
tags before parsing it (this seems nice because then I wouldn't have to change any other code that recognizes tables with<thead>
), or alternately - Change
parse_thead
andparse_tbody
to recognize the table rows that have only<th>
elements. (With either alternative, it seems like I really need to detect the head-body boundary in this way.)
I don't know how to do either of those things and I'd appreciate advice on both which alternative is more sensible and how I might go about it.
(Edit: Examples with no header rows and multiple header rows. I can't assume it has only one header row.)
<table class="wikitable">
<tr>
<th>Rank</th>
<th>Score</th>
<th>Overs</th>
<th><b>Ext</b></th>
<th>b</th>
<th>lb</th>
<th>w</th>
<th>nb</th>
<th>Opposition</th>
<th>Ground</th>
<th>Match Date</th>
</tr>
<tr>
<td>1</td>
<td>437</td>
<td>136.0</td>
<td><b>64</b></td>
<td>18</td>
<td>11</td>
<td>1</td>
<td>34</td>
<td>v West Indies</td>
<td>Manchester</td>
<td>27 Jul 1995</td>
</tr>
</table>
Solution
We can use <th>
tags to detect headers, in case the table doesn't contain <thead>
tags. If all columns of a row are <th>
tags then we can assume that it is a header. Based on that I created a function that identifies the header and body.
Code for BeautifulSoup
:
def parse_table(table):
head_body = {'head':[], 'body':[]}
for tr in table.select('tr'):
if all(t.name == 'th' for t in tr.find_all(recursive=False)):
head_body['head'] += [tr]
else:
head_body['body'] += [tr]
return head_body
Code for lxml
:
def parse_table(table):
head_body = {'head':[], 'body':[]}
for tr in table.cssselect('tr'):
if all(t.tag == 'th' for t in tr.getchildren()):
head_body['head'] += [tr]
else:
head_body['body'] += [tr]
return head_body
The table
parameter is either a Beautiful Soup Tag object or a lxml Element object. head_body
is a dictionary that contains two lists of <tr>
tags, the header and body rows.
Usage example:
html = '<table><tr><th>heade</th></tr><tr><td>body</td></tr></table>'
soup = BeautifulSoup(html, 'html.parser')
table = soup.find('table')
table_rows = parse_table(table)
print(table_rows)
#{'head': [<tr><th>header</th></tr>], 'body': [<tr><td>body</td></tr>]}
Answered By - t.m.adam
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.