Sunday, April 10, 2022

[FIXED] Retrieving table values from HTML with the same tag names using Beautiful Soup in Python

April 10, 2022 beautifulsoup, html, python No comments

Issue

I am trying to retrieve all the td text for the below table using Beautiful Soup, unfortunately the tag names are the same and I am either only able to retrieve the first element or some elements are repeatedly printing. Hence not really sure of how to go about it.

Below is HTML table snippet:

<div>Table</div>
<table class="Auto" width="100%">
    <tr>
       <td class="Auto_head">Address</td>
       <td class="Auto_head">Name</td>
       <td class="Auto_head">Type</td>
       <td class="Auto_head">Value IN</td>
       <td class="Auto_head">AUTO Statement</td>
       <td class="Auto_head">Value OUT</td>
       <td class="Auto_head">RESULT</td>
       <td class="Auto_head"></td>
    </tr>
    <tr>
           <td class="Auto_body">1</td>
           <td class="Auto_body">abc</td>
           <td class="Auto_body">yes</td>
           <td class="Auto_body">abc123</td>
           <td class="Auto_body">jar</td>
           <td class="Auto_body">123abc</td>
           <td class="Auto_body">PASS</td>
           <td class="Auto_body">na</td>
    </tr>

What I want is all the text content inside these tags for example the first auto_head corresponds to first auto_body i.e. Address = 1 similarly all the values should be retrieved.

I have used find,findall,findNext and next_sibling but no luck. Here is my current code in python:

self.table = self.soup_file.findAll(class_="Table")
self.headers = [tab.find(class_="Auto_head").findNext('td',class_="Auto_head").contents[0] for tab in self.table]
self.data = [data.find(class_="Auto_body").findNext('td').contents[0] for data in self.table]

Solution

Get the headers first, then use zip(...) to combine

from bs4 import BeautifulSoup

data = '''\
<table class="Auto" width="100%">
    <tr>
       <td class="Auto_head">Address</td>
       <td class="Auto_head">Name</td>
       <td class="Auto_head">Type</td>
    </tr>
    <tr>
           <td class="Auto_body">1</td>
           <td class="Auto_body">abc</td>
           <td class="Auto_body">yes</td>
    </tr>
    <tr>
           <td class="Auto_body">2</td>
           <td class="Auto_body">def</td>
           <td class="Auto_body">no</td>
    </tr>
    <tr>
           <td class="Auto_body">3</td>
           <td class="Auto_body">ghi</td>
           <td class="Auto_body">maybe</td>
    </tr>
</table>
'''

soup = BeautifulSoup(data, 'html.parser')

for table in soup.select('table.Auto'):
    # get rows
    rows = table.select('tr')
    # get headers
    headers = [td.text for td in rows[0].select('td.Auto_head')]
    # get details
    for row in rows[1:]:
        values = [td.text for td in row.select('td.Auto_body')]
        print(dict(zip(headers, values)))

My output:

{'Address': '1', 'Name': 'abc', 'Type': 'yes'}
{'Address': '2', 'Name': 'def', 'Type': 'no'}
{'Address': '3', 'Name': 'ghi', 'Type': 'maybe'}

Answered By - Justin Ezequiel

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Sunday, April 10, 2022

[FIXED] Retrieving table values from HTML with the same tag names using Beautiful Soup in Python

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels