Issue
This is a snippet from the page I am trying to parse with Pandas using Python:
<!DOCTYPE html><html><head><title>website</title><link rel='stylesheet' type='text/css' href='css/global.css'><META HTTP-EQUIV='Content-Type' CONTENT='text/html; charset=UTF-8'></head><body><script src="analyticstracking.js"></script>
</h3><table class='gene'><tr><th>header1<br>info</th>
<th><a href='useful.php#cods'>header2</a><br>info</th><th><a href='useful.php#cods'>header3</a><br>info</th><th><a href='http://www.somelink' target=link onclick="trackOutboundLink('http://www.somelink'); return false;">header4</a><br><span class='td'>info</span></th>
<th><a href='http://www.somelink' target=link onclick="trackOutboundLink('http://www.somelink'); return false;">header5</a><br><span class='td'>info</span></th>
<th>header6<br>info</th><th>header7</th><th>header8</th><th><a href='useful.php'>header9<br>info</a></th></tr>
<tr class='even'><td class='center'><form action='get.php' method='GET'>
<input type='hidden' name='acc' value='value1'><input type='submit' value='value1'></form></td>
<td>stuff</td><td>stuff</td><td>stuff</td><td>stuff</td><td class='center'><span class='dm' title='some extra info'>stuff</span> </td><td>stuff</td><td><a href='http://www.link1' target=ref onclick="trackOutboundLink('http://www.link1'); return false;">link1</a><br><span class='td'><a href='http://www.link2' target=ref onclick="trackOutboundLink('http://www.link2'); return false;">link2</span><br><span class='td'><a href='http://www.link3' target=ref onclick="trackOutboundLink('http://www.link3'); return false;">link3</span><br></td><td style='white-space:nowrap;'> <span class='gen' title='extra_info'>stuff</span> <span class='gen' title='extra_info2'>stuff2</span> <a href='http://www.out' target='out' title='Link to out' onclick="trackOutboundLink('http://www.out'); return false;"><span class='dbs'>out</span></a> </td></tr>
<tr class='even'><td class='center'><form action='get.php' method='GET'>
<input type='hidden' name='acc' value='value2'><input type='submit' value='value2'></form></td>
<td>stuff2</td><td>stuff2</td><td>stuff2</td><td>stuff2</td><td class='center'><span class='dm' title='some extra info'>stuff2</span> </td><td>stuff2</td><td><a href='http://www.link4' target=ref onclick="trackOutboundLink('http://www.link5'); return false;">link4</a><br><span class='td'><a href='http://www.link5' target=ref onclick="trackOutboundLink('http://www.link5'); return false;">link5</span><br></td><td style='white-space:nowrap;'> <span class='gen' title='extra_info'>stuff</span> <span class='gen' title='extra_info2'>stuff</span> <a href='http://www.out2' target='out2' title='Link to out2' onclick="trackOutboundLink('http://www.out2'); return false;"><span class='dbs'>out2</span></a> </td></tr>
<tr class='odd'><td class='center'><form action='get.php' method='GET'>
<input type='hidden' name='acc' value='value3'><input type='submit' value='value3'></form></td>
<td>stuff3</td><td>stuff3</td><td>stuff3</td><td>stuff3</td><td class='center'><span class='dm' title='extrainfo'>stuff3</span> </td><td>stuff3</td><td><a href='http://www.link6' target=ref onclick="trackOutboundLink('http://www.link6'); return false;">link6</a></td><td style='white-space:nowrap;'> <span class='gen' title='extra_info'>stuff3</span> <span class='gen' title='extra_info2'>stuff3</span> </td></tr>
</table>
There are hidden variables in the table (header 6 and header 9) that when you hover your mouse over, you can see the information:
and when I try with Pandas, I get the following:
with open ("/root/Downloads/adad.html", "r") as content_file:
f = content_file.read()
dfs = pd.read_html(f)
dfs
My wish is to obtain the following:
[ header1info header2info header3info header4info header5info header6info header7 header8 header9info
0 value1 stuff stuff stuff stuff stuff(extra_info) stuff link1(http://link1) stuff(extra_info) stuff2(extra_info2) out(http://out)
link2(http://link2)
link3(http://link3)
1 value2 stuff2 stuff2 stuff2 stuff2 stuff2 stuff2 link4(http://link4) stuff(extra_info) stuff(extra_info2) out2(http://out)
link5(http://link5)
2 value3 stuff3 stuff3 stuff3 stuff3 stuff3 stuff3 link6(http://link6) stuff3(extra_info) stuff3(extra_info2)]
Is this possible using Pandas? If yes, how can I achieve the desired output?
Sorry, I am not expert when it comes to Pandas. I am not sure if there are also other ways to parse the information. The only thing that comes to my mind is to split lines and get the needed information but you can only imagine how fastidious it is ...
Solution
short answer: NO
pd.read_html()
only reads the text generated on the html, not the elements with their attributes. To achieve what you want, you might want to use an HTML parser like bs4 instead, and then find the table class='gene'
, then iterate through <tr>
and <td>
inside it. The code is something like below:
import pandas as pd
from bs4 import BeautifulSoup
source = r"""<!DOCTYPE html><html><head><title>website</title><link rel='stylesheet' type='text/css' href='css/global.css'><META HTTP-EQUIV='Content-Type' CONTENT='text/html; charset=UTF-8'></head><body><script src="analyticstracking.js"></script>
</h3><table class='gene'><tr><th>header1<br>info</th>
<th><a href='useful.php#cods'>header2</a><br>info</th><th><a href='useful.php#cods'>header3</a><br>info</th><th><a href='http://www.somelink' target=link onclick="trackOutboundLink('http://www.somelink'); return false;">header4</a><br><span class='td'>info</span></th>
<th><a href='http://www.somelink' target=link onclick="trackOutboundLink('http://www.somelink'); return false;">header5</a><br><span class='td'>info</span></th>
<th>header6<br>info</th><th>header7</th><th>header8</th><th><a href='useful.php'>header9<br>info</a></th></tr>
<tr class='even'><td class='center'><form action='get.php' method='GET'>
<input type='hidden' name='acc' value='value1'><input type='submit' value='value1'></form></td>
<td>stuff</td><td>stuff</td><td>stuff</td><td>stuff</td><td class='center'><span class='dm' title='some extra info'>stuff</span> </td><td>stuff</td><td><a href='http://www.link1' target=ref onclick="trackOutboundLink('http://www.link1'); return false;">link1</a><br><span class='td'><a href='http://www.link2' target=ref onclick="trackOutboundLink('http://www.link2'); return false;">link2</span><br><span class='td'><a href='http://www.link3' target=ref onclick="trackOutboundLink('http://www.link3'); return false;">link3</span><br></td><td style='white-space:nowrap;'> <span class='gen' title='extra_info'>stuff</span> <span class='gen' title='extra_info2'>stuff2</span> <a href='http://www.out' target='out' title='Link to out' onclick="trackOutboundLink('http://www.out'); return false;"><span class='dbs'>out</span></a> </td></tr>
<tr class='even'><td class='center'><form action='get.php' method='GET'>
<input type='hidden' name='acc' value='value2'><input type='submit' value='value2'></form></td>
<td>stuff2</td><td>stuff2</td><td>stuff2</td><td>stuff2</td><td class='center'><span class='dm' title='some extra info'>stuff2</span> </td><td>stuff2</td><td><a href='http://www.link4' target=ref onclick="trackOutboundLink('http://www.link5'); return false;">link4</a><br><span class='td'><a href='http://www.link5' target=ref onclick="trackOutboundLink('http://www.link5'); return false;">link5</span><br></td><td style='white-space:nowrap;'> <span class='gen' title='extra_info'>stuff</span> <span class='gen' title='extra_info2'>stuff</span> <a href='http://www.out2' target='out2' title='Link to out2' onclick="trackOutboundLink('http://www.out2'); return false;"><span class='dbs'>out2</span></a> </td></tr>
<tr class='odd'><td class='center'><form action='get.php' method='GET'>
<input type='hidden' name='acc' value='value3'><input type='submit' value='value3'></form></td>
<td>stuff3</td><td>stuff3</td><td>stuff3</td><td>stuff3</td><td class='center'><span class='dm' title='extrainfo'>stuff3</span> </td><td>stuff3</td><td><a href='http://www.link6' target=ref onclick="trackOutboundLink('http://www.link6'); return false;">link6</a></td><td style='white-space:nowrap;'> <span class='gen' title='extra_info'>stuff3</span> <span class='gen' title='extra_info2'>stuff3</span> </td></tr>
</table>"""
soup = BeautifulSoup(source, 'html.parser')
table = soup.findAll("table", {"class": "gene"})
trs = table[0].findAll("tr")
headers = []
for th in trs[0].findAll("th"):
headers.append(th.text)
rows = []
for i in range(1, len(trs)):
tds = []
for td in trs[i].findAll("td"):
a = td.findAll("a")
spans = td.findAll("span")
inputs = td.findAll("input")
ret = ""
if len(a) != 0 or len(spans) != 0 or len(inputs) != 0:
if len(a) != 0:
for link in a:
ret += link.text + '('+link['href']+') '
if len(spans) != 0:
for span in spans:
if span.has_attr('title'):
ret += span.text + '('+span['title']+') '
if len(inputs) != 0:
for inp in inputs:
if inp.has_attr('value'):
if inp.has_attr('type'):
if inp['type'] == "hidden":
ret += inp['value']
else:
ret = td.text if td.text != '' and td.text != '\n' else "NaN"
tds.append(ret)
rows.append(tds)
df = pd.DataFrame(rows, columns = headers)
df
Answered By - Damzaky
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.