Issue
I have this HTML file which was obtained from a website that has financial data.
<table class="tableFile2" summary="Results">
<tr>
<td nowrap="nowrap">
13F-HR
</td>
<td nowrap="nowrap">
<a href="URL" id="documentsbutton">
Documents
</a>
</td>
<td>
2019-05-15
</td>
<td nowrap="nowrap">
<a href="URL">
028-10098
</a>
<br/>
19827821
</td>
</tr>
<tr class="blueRow">
<td nowrap="nowrap">
13F-HR
</td>
<td nowrap="nowrap">
<a href="URL" id="documentsbutton">
Documents
</a>
</td>
<td>
2019-02-14
</td>
<td nowrap="nowrap">
<a href="URL">
028-10098
</a>
<br/>
19606811
</td>
</tr>
<tr>
<td nowrap="nowrap">
SC 13G/A
</td>
<td nowrap="nowrap">
<a href="URL" id="documentsbutton">
Documents
</a>
</td>
<td>
2019-02-13
</td>
<td>
</td>
</tr>
<tr class="blueRow">
<td nowrap="nowrap">
SC 13G/A
</td>
<td nowrap="nowrap">
<a href="URL" id="documentsbutton">
Documents
</a>
</td>
<td>
2019-02-13
</td>
<td>
</td>
</tr>
<tr>
<td nowrap="nowrap">
SC 13G/A
</td>
<td nowrap="nowrap">
<a href="URL" id="documentsbutton">
Documents
</a>
</td>
<td>
2019-02-13
</td>
<td>
</td>
</tr>
</table>
I am trying to extract only rows where one of the cells contains the word 13F. Once I get the correct rows, I want to be able to save the date and the href into a list for later processing. Currently I managed to build my scraper to successfully locate a specific table, but I am having trouble filtering specific rows based off of my criteria. Currently when I try to add a conditional it seems to ignore it and continue to include rows all rows.
r = requests.get(url)
soup = BeautifulSoup(open("data/testHTML.html"), 'html.parser')
table = soup.find('table', {"class": "tableFile2"})
rows = table.findChildren("tr")
for row in rows:
cell = row.findNext("td")
if cell.text.find('13F'):
print(row)
Ideally I am trying to get an output similar to this
[13F-HR, URL, 2019-05-15],[13F-HR, URL, 2019-02-14]
Solution
Use regular
expression re to find the text of cell.
from bs4 import BeautifulSoup
import re
data='''<table class="tableFile2" summary="Results">
<tr>
<td nowrap="nowrap">
13F-HR
</td>
<td nowrap="nowrap">
<a href="URL" id="documentsbutton">
Documents
</a>
</td>
<td>
2019-05-15
</td>
<td nowrap="nowrap">
<a href="URL">
028-10098
</a>
<br/>
19827821
</td>
</tr>
<tr class="blueRow">
<td nowrap="nowrap">
13F-HR
</td>
<td nowrap="nowrap">
<a href="URL" id="documentsbutton">
Documents
</a>
</td>
<td>
2019-02-14
</td>
<td nowrap="nowrap">
<a href="URL">
028-10098
</a>
<br/>
19606811
</td>
</tr>
<tr>
<td nowrap="nowrap">
SC 13G/A
</td>
<td nowrap="nowrap">
<a href="URL" id="documentsbutton">
Documents
</a>
</td>
<td>
2019-02-13
</td>
<td>
</td>
</tr>
<tr class="blueRow">
<td nowrap="nowrap">
SC 13G/A
</td>
<td nowrap="nowrap">
<a href="URL" id="documentsbutton">
Documents
</a>
</td>
<td>
2019-02-13
</td>
<td>
</td>
</tr>
<tr>
<td nowrap="nowrap">
SC 13G/A
</td>
<td nowrap="nowrap">
<a href="URL" id="documentsbutton">
Documents
</a>
</td>
<td>
2019-02-13
</td>
<td>
</td>
</tr>
</table>'''
soup=BeautifulSoup(data,'html.parser')
table = soup.find('table', {"class": "tableFile2"})
rows=table.find_all('tr')
final_items=[]
for row in rows:
items = []
cell=row.find('td',text=re.compile('13F'))
if cell:
items.append(cell.text.strip())
items.append(cell.find_next('a')['href'])
items.append(cell.find_next('a').find_next('td').text.strip())
final_items.append(items)
print(final_items)
Output:
[['13F-HR', 'URL', '2019-05-15'], ['13F-HR', 'URL', '2019-02-14']]
Answered By - KunduK
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.