Monday, December 4, 2023

[FIXED] Detecting header in HTML tables using beautifulsoup / lxml when table lacks thead element

December 04, 2023 beautifulsoup, lxml, python No comments

Issue

I'd like to detect the header of an HTML table when that table does not have <thead> elements. (MediaWiki, which drives Wikipedia, does not support <thead> elements.) I'd like to do this with python in both BeautifulSoup and lxml. Let's say I already have a table object and I'd like to get out of it a thead object, a tbody object, and a tfoot object.

Currently, parse_thead does the following when the <thead> tag is present:

In BeautifulSoup, I get table objects with doc.find_all('table') and I can use table.find_all('thead')
In lxml, I get table objects with doc.xpath() on an xpath_expr on //table, and I can use table.xpath('.//thead')

and parse_tbody and parse_tfoot work in the same way. (I did not write this code and I am not experienced with either BS or lxml.) However, without a <thead>, parse_thead returns nothing and parse_tbody returns the header and the body together.

I append a wikitable instance below as an example. It lacks <thead> and <tbody>. Instead all rows, header or not, are enclosed in <tr>...</tr>, but header rows have <th> elements and body rows have <td> elements. Without <thead>, it seems like the right criterion for identifying the header is "from the start, put rows into the header until you find a row that has an element that's not <th>".

I'd appreciate suggestions on how I could write parse_thead and parse_tbody. Without much experience here, I would think I could either

Dive into the table object and manually insert thead and tbody tags before parsing it (this seems nice because then I wouldn't have to change any other code that recognizes tables with <thead>), or alternately
Change parse_thead and parse_tbody to recognize the table rows that have only <th> elements. (With either alternative, it seems like I really need to detect the head-body boundary in this way.)

I don't know how to do either of those things and I'd appreciate advice on both which alternative is more sensible and how I might go about it.

(Edit: Examples with no header rows and multiple header rows. I can't assume it has only one header row.)

<table class="wikitable">
<tr>
<th>Rank</th>
<th>Score</th>
<th>Overs</th>
<th><b>Ext</b></th>
<th>b</th>
<th>lb</th>
<th>w</th>
<th>nb</th>
<th>Opposition</th>
<th>Ground</th>
<th>Match Date</th>
</tr>
<tr>
<td>1</td>
<td>437</td>
<td>136.0</td>
<td><b>64</b></td>
<td>18</td>
<td>11</td>
<td>1</td>
<td>34</td>
<td>v West Indies</td>
<td>Manchester</td>
<td>27 Jul 1995</td>
</tr>
</table>

Solution

We can use <th> tags to detect headers, in case the table doesn't contain <thead> tags. If all columns of a row are <th> tags then we can assume that it is a header. Based on that I created a function that identifies the header and body.

Code for BeautifulSoup:

def parse_table(table): 
    head_body = {'head':[], 'body':[]}
    for tr in table.select('tr'): 
        if all(t.name == 'th' for t in tr.find_all(recursive=False)): 
            head_body['head'] += [tr]
        else: 
            head_body['body'] += [tr]
    return head_body

Code for lxml:

def parse_table(table): 
    head_body = {'head':[], 'body':[]}
    for tr in table.cssselect('tr'): 
        if all(t.tag == 'th' for t in tr.getchildren()): 
            head_body['head'] += [tr]
        else: 
            head_body['body'] += [tr]
    return head_body

The table parameter is either a Beautiful Soup Tag object or a lxml Element object. head_body is a dictionary that contains two lists of <tr> tags, the header and body rows.

Usage example:

html = '<table><tr><th>heade</th></tr><tr><td>body</td></tr></table>'
soup = BeautifulSoup(html, 'html.parser')
table = soup.find('table')
table_rows = parse_table(table)

print(table_rows)
#{'head': [<tr><th>header</th></tr>], 'body': [<tr><td>body</td></tr>]}

Answered By - t.m.adam

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Monday, December 4, 2023

[FIXED] Detecting header in HTML tables using beautifulsoup / lxml when table lacks thead element

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels