Issue
I have a large table from the web, accessed via requests and parsed with BeautifulSoup. Part of it looks something like this:
<table>
<tbody>
<tr>
<td>265</td>
<td> <a href="/j/jones03.shtml">Jones</a>Blue</td>
<td>29</td>
</tr>
<tr >
<td>266</td>
<td> <a href="/s/smith01.shtml">Smith</a></td>
<td>34</td>
</tr>
</tbody>
</table>
When I convert this to pandas using pd.read_html(tbl)
the output is like this:
0 1 2
0 265 JonesBlue 29
1 266 Smith 34
I need to keep the information in the <A HREF ... >
tag, since the unique identifier is stored in the link. That is, the table should look like this:
0 1 2
0 265 jones03 29
1 266 smith01 34
I'm fine with various other outputs (for example, jones03 Jones
would be even more helpful) but the unique ID is critical.
Other cells also have html tags in them, and in general I don't want those to be saved, but if that's the only way of getting the uid I'm OK with keeping those tags and cleaning them up later, if I have to.
Is there a simple way of accessing this information?
Solution
Since this parsing job requires the extraction of both text and attribute
values, it can not be done entirely "out-of-the-box" by a function such as
pd.read_html
. Some of it has to be done by hand.
Using lxml, you could extract the attribute values with XPath:
import lxml.html as LH
import pandas as pd
content = '''
<table>
<tbody>
<tr>
<td>265</td>
<td> <a href="/j/jones03.shtml">Jones</a>Blue</td>
<td >29</td>
</tr>
<tr >
<td>266</td>
<td> <a href="/s/smith01.shtml">Smith</a></td>
<td>34</td>
</tr>
</tbody>
</table>'''
table = LH.fromstring(content)
for df in pd.read_html(content):
df['refname'] = table.xpath('//tr/td/a/@href')
df['refname'] = df['refname'].str.extract(r'([^./]+)[.]')
print(df)
yields
0 1 2 refname
0 265 JonesBlue 29 jones03
1 266 Smith 34 smith01
The above may be useful since it requires only a few
extra lines of code to add the refname
column.
But both LH.fromstring
and pd.read_html
parse the HTML.
So it's efficiency could be improved by removing pd.read_html
and
parsing the table once with LH.fromstring
:
table = LH.fromstring(content)
# extract the text from `<td>` tags
data = [[elt.text_content() for elt in tr.xpath('td')]
for tr in table.xpath('//tr')]
df = pd.DataFrame(data, columns=['id', 'name', 'val'])
for col in ('id', 'val'):
df[col] = df[col].astype(int)
# extract the href attribute values
df['refname'] = table.xpath('//tr/td/a/@href')
df['refname'] = df['refname'].str.extract(r'([^./]+)[.]')
print(df)
yields
id name val refname
0 265 JonesBlue 29 jones03
1 266 Smith 34 smith01
Answered By - unutbu
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.