Issue
I have the below python script where table variable holds the entire html. The trfirst has the first row element which is my headers. I want to remove the trfirst from table and put this into a new variable so I can retrieve the row values. I am using BeautifulSoup4.
HTML example looks like this:
<tr>
<td>
<div> Header 1</div>
</td>
<td>
<div> Header 2</div>
</td>
<td>
<div> Header 3</div>
</td>
</tr>
<tr>
<td>
<div> Row 1</div>
</td>
<td>
<div> Row 1</div>
</td>
<td>
<div> Row 1</div>
</td>
</tr>
<tr>
<td>
<div> Row 2</div>
</td>
<td>
<div> Row 2</div>
</td>
<td>
<div> Row 2</div>
</td>
</tr>
Python:
url = "C:/Test.html"
html = open(url, "r", encoding='utf-8').read()
soup = BeautifulSoup(html, features='lxml')
table = soup.select_one("table")
trfirst = table.find("tr")
**trrest = ??? (table - trfirst)**
Solution
You can use use nth-child and nth-child ranges to separate the headers from the body
from bs4 import BeautifulSoup as bs
html = '''your html'''
soup = bs(html, 'lxml') # 'html.parser'
headers = [i.text.strip() for i in soup.select('tr:nth-child(1) td div')]
print(headers)
data = [[j.text.strip() for j in r.select('td div')] for r in soup.select('tr:nth-child(n+2)')]
print(data)
Answered By - QHarr
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.