Issue
I receive an html table that have always the same shape. Only the values differ in each time.
html = '''
<table align="center">
<tr>
<th>Name</th>
<td>NAME A</td>
<th>Status</th>
<td class="IN PROGRESS">IN PROGRESS</td>
</tr>
<tr>
<th>Category</th>
<td COLSPAN="3">CATEGORY A</td>
</tr>
<tr>
<th>Creation date</th>
<td>13/01/23 23:00</td>
<th>End date</th>
<td></td>
</tr>
</table>
'''
I need to convert it to a dataframe but pandas is giving me a weird format :
print(pd.read_html(html)[0])
0 1 2 3
0 Name NAME A Status IN PROGRESS
1 Category CATEGORY A CATEGORY A CATEGORY A
2 Creation date 13/01/23 23:00 End date NaN
I feel like we need to use beautifulsoup but I'm not sure how :
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
Can you guys help me with that ?
My expected output is this dataframe :
Name Category Status Creation date End date
0 NAME A CATEGORY A RUNNING 27/07/2023 11:43 NaN
Solution
Based on your example you could iterate the <td>
´s and store its text with its previous sibling <th>
in a dict
and create your dataframe
:
{e.find_previous_sibling('th').text:e.text for e in soup.select('table td')}
Example
from bs4 import BeautifulSoup
import pandas as pd
html = '''
<table align="center">
<tr>
<th>Name</th>
<td>NAME A</td>
<th>Status</th>
<td class="IN PROGRESS">IN PROGRESS</td>
</tr>
<tr>
<th>Category</th>
<td COLSPAN="3">CATEGORY A</td>
</tr>
<tr>
<th>Creation date</th>
<td>13/01/23 23:00</td>
<th>End date</th>
<td></td>
</tr>
</table>
'''
soup = BeautifulSoup(html)
pd.DataFrame(
[
{e.find_previous_sibling('th').text:e.text for e in soup.select('table td')}
]
)
Result
Name | Status | Category | Creation date | End date | |
---|---|---|---|---|---|
0 | NAME A | IN PROGRESS | CATEGORY A | 13/01/23 23:00 |
Answered By - HedgeHog
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.