Saturday, October 16, 2021

[FIXED] Extracting certain values from a row if a cell meets a certain condition

October 16, 2021 beautifulsoup, python No comments

Issue

I have this HTML file which was obtained from a website that has financial data.

    <table class="tableFile2" summary="Results">
     <tr>
      <td nowrap="nowrap">
       13F-HR
      </td>
      <td nowrap="nowrap">
       <a href="URL" id="documentsbutton">
        Documents
       </a>
      </td>
      <td>
       2019-05-15
      </td>
      <td nowrap="nowrap">
       <a href="URL">
        028-10098
       </a>
       <br/>
       19827821
      </td>
     </tr>
     <tr class="blueRow">
      <td nowrap="nowrap">
       13F-HR
      </td>
      <td nowrap="nowrap">
       <a href="URL" id="documentsbutton">
        Documents
       </a>
      </td>
      <td>
       2019-02-14
      </td>
      <td nowrap="nowrap">
       <a href="URL">
        028-10098
       </a>
       <br/>
       19606811
      </td>
     </tr>
     <tr>
      <td nowrap="nowrap">
       SC 13G/A
      </td>
      <td nowrap="nowrap">
       <a href="URL" id="documentsbutton">
        Documents
       </a>
      </td>
      <td>
       2019-02-13
      </td>
      <td>
      </td>
     </tr>
     <tr class="blueRow">
      <td nowrap="nowrap">
       SC 13G/A
      </td>
      <td nowrap="nowrap">
       <a href="URL" id="documentsbutton">
        Documents
       </a>
      </td>
      <td>
       2019-02-13
      </td>
      <td>
      </td>
     </tr>
     <tr>
      <td nowrap="nowrap">
       SC 13G/A
      </td>
      <td nowrap="nowrap">
       <a href="URL" id="documentsbutton">
        Documents
       </a>
      </td>
      <td>
       2019-02-13
      </td>
      <td>
      </td>
     </tr>
    </table>

I am trying to extract only rows where one of the cells contains the word 13F. Once I get the correct rows, I want to be able to save the date and the href into a list for later processing. Currently I managed to build my scraper to successfully locate a specific table, but I am having trouble filtering specific rows based off of my criteria. Currently when I try to add a conditional it seems to ignore it and continue to include rows all rows.

r = requests.get(url)
soup = BeautifulSoup(open("data/testHTML.html"), 'html.parser')

table = soup.find('table', {"class": "tableFile2"})
rows = table.findChildren("tr")
for row in rows:
    cell = row.findNext("td")
    if cell.text.find('13F'):
        print(row)

Ideally I am trying to get an output similar to this

[13F-HR, URL, 2019-05-15],[13F-HR, URL, 2019-02-14]

Solution

Use regular expression re to find the text of cell.

from bs4 import BeautifulSoup
import re
data='''<table class="tableFile2" summary="Results">
     <tr>
      <td nowrap="nowrap">
       13F-HR
      </td>
      <td nowrap="nowrap">
       <a href="URL" id="documentsbutton">
        Documents
       </a>
      </td>
      <td>
       2019-05-15
      </td>
      <td nowrap="nowrap">
       <a href="URL">
        028-10098
       </a>
       <br/>
       19827821
      </td>
     </tr>
     <tr class="blueRow">
      <td nowrap="nowrap">
       13F-HR
      </td>
      <td nowrap="nowrap">
       <a href="URL" id="documentsbutton">
        Documents
       </a>
      </td>
      <td>
       2019-02-14
      </td>
      <td nowrap="nowrap">
       <a href="URL">
        028-10098
       </a>
       <br/>
       19606811
      </td>
     </tr>
     <tr>
      <td nowrap="nowrap">
       SC 13G/A
      </td>
      <td nowrap="nowrap">
       <a href="URL" id="documentsbutton">
        Documents
       </a>
      </td>
      <td>
       2019-02-13
      </td>
      <td>
      </td>
     </tr>
     <tr class="blueRow">
      <td nowrap="nowrap">
       SC 13G/A
      </td>
      <td nowrap="nowrap">
       <a href="URL" id="documentsbutton">
        Documents
       </a>
      </td>
      <td>
       2019-02-13
      </td>
      <td>
      </td>
     </tr>
     <tr>
      <td nowrap="nowrap">
       SC 13G/A
      </td>
      <td nowrap="nowrap">
       <a href="URL" id="documentsbutton">
        Documents
       </a>
      </td>
      <td>
       2019-02-13
      </td>
      <td>
      </td>
     </tr>
    </table>'''

soup=BeautifulSoup(data,'html.parser')
table = soup.find('table', {"class": "tableFile2"})
rows=table.find_all('tr')

final_items=[]
for row in rows:
    items = []
    cell=row.find('td',text=re.compile('13F'))
    if cell:
        items.append(cell.text.strip())
        items.append(cell.find_next('a')['href'])
        items.append(cell.find_next('a').find_next('td').text.strip())
        final_items.append(items)

print(final_items)

Output:

 [['13F-HR', 'URL', '2019-05-15'], ['13F-HR', 'URL', '2019-02-14']]

Answered By - KunduK

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Saturday, October 16, 2021

[FIXED] Extracting certain values from a row if a cell meets a certain condition

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels