Monday, December 25, 2023

[FIXED] How to Scrap the Value from the <tr><td><font size="2"><strong> tag

December 25, 2023 beautifulsoup, html-lists, python, web-scraping No comments

Issue

I'm new to web scraping with Python, having just started last week. I have a question about how to extract information using HTML tags, particularly when dealing with tags. On the website "https://nces.ed.gov/ccd/districtsearch/district_detail.asp?Search=1&details=1&State=06&DistrictType=1&DistrictType=2&DistrictType=3&DistrictType=4&DistrictType=5&DistrictType=6&DistrictType=7&DistrictType=8&DistrictType=9&NumOfStudentsRange=more&NumOfSchoolsRange=more&DistrictPageNum=1&ID2=0601620", I'd like to scrape the "Website" value located in the bottom left table.

In my limited experience, I've tried various methods but haven't found a solution. I'm struggling to understand how to approach this. I need to perform this extraction dynamically for a total of 2115 pages. The URL I'm trying to extract is "http://www.abcusd.us" from the provided HTML content. I would greatly appreciate any suggestions you can provide. Thank you.

<tbody><tr>
    <td valign="top" width="220"><b><font size="2">District Name:</font></b><br>
        <font size="3">ABC Unified<br></font><font size="2"><a href="../schoolsearch/school_list.asp?Search=1&amp;DistrictID=0601620">schools for this district</a></font></td>
    <td valign="top" width="220">
        <b><font size="2">NCES District ID:</font></b><br><font size="3">0601620</font></td>
    <td valign="top"><b><font size="2">State District ID:</font></b><br><font size="3">
        CA-1964212</font></td>
    </tr>
    

    <tr>
    <td valign="top" width="220"><img border="0" src="/ccd/commonfiles/images/spacer.gif" width="34" height="1"></td>
    <td valign="top" width="220"><img border="0" src="/ccd/commonfiles/images/spacer.gif" width="34" height="1"></td>
    <td valign="top"><img border="0" src="/ccd/commonfiles/images/spacer.gif" width="34" height="1"></td>
    </tr>
    <tr>    
<td valign="top" width="220"><b><font size="2">Mailing Address:</font></b><br><font size="3">16700 Norwalk BLVD.<br>Cerritos,&nbsp;CA&nbsp;90703-1838</font></td><td valign="top" width="40%"><strong><font size="2">Physical Address:</font> <a href="/ccd/schoolmap/#district_ids=0601620" title="Map latest data in the School &amp; District Navigator" target="_blank"><img style="height:20px;vertical-align:middle;margin-bottom:-20px;margin-top:-32px" src="/ccd/commonfiles/images/mapapp_icon.png" title="Map latest data in the School &amp; District Navigator"></a></strong><br><font size="3"><a href="/ccd/schoolmap/#district_ids=0601620" title="Map latest data in the School &amp; District Navigator" target="_blank">16700 Norwalk BLVD.<br>Cerritos,&nbsp;CA&nbsp;90703-1838</a></font></td>

    <td valign="top"><b><font size="2">Phone:</font></b><br><font size="3">
        (562)926-5566</font>
    </td>
    </tr>
    <tr>
    <td valign="top">
        <p align=""><b><font size="2">Type: </font></b><br>
        <font size="3">Regular local school district</font>
    </p></td>
    <td valign="top">
        <p align=""><b><font size="2">Status:</font></b><br>
        <font size="3">Open</font>
    </p></td>
    <td valign="top">
        <p align=""><b><font size="2">Total Schools:</font></b><br>
        <font size="3">31</font>
    </p></td>
    </tr>

    <tr>
        <td valign="top">
            <b><font size="2">Supervisory Union #: </font></b><br>
            <font size="3">N/A</font>
        </td>
        <td valign="top" colspan="2">
            <b><font size="2">Grade Span: </font></b>
                <font size="2"> (grades KG - 12)</font>
            <br>
            
            <table><tbody><tr><td><table border="0" bordercolor="#134F8A" cellspacing="0" cellpadding="0" bgcolor="#134F8A"><tbody><tr><td width="100%" bordercolor="#134F8A"><table border="0" cellspacing="1" cellpadding="0"><tbody><tr><td width="16" align="center" bgcolor="#23619E"><font size="2">&nbsp;&nbsp;</font></td><td width="16" align="center" bgcolor="#F5F196"><font size="2">KG</font></td><td width="16" align="center" bgcolor="#F5F196"><font size="2">1</font></td><td width="16" align="center" bgcolor="#F5F196"><font size="2">2</font></td><td width="16" align="center" bgcolor="#F5F196"><font size="2">3</font></td><td width="16" align="center" bgcolor="#F5F196"><font size="2">4</font></td><td width="16" align="center" bgcolor="#F5F196"><font size="2">5</font></td><td width="16" align="center" bgcolor="#F5F196"><font size="2">6</font></td><td width="16" align="center" bgcolor="#F5F196"><font size="2">7</font></td><td width="16" align="center" bgcolor="#F5F196"><font size="2">8</font></td><td width="16" align="center" bgcolor="#F5F196"><font size="2">9</font></td><td width="16" align="center" bgcolor="#F5F196"><font size="2">10</font></td><td width="16" align="center" bgcolor="#F5F196"><font size="2">11</font></td><td width="16" align="center" bgcolor="#F5F196"><font size="2">12</font></td></tr></tbody></table></td></tr></tbody></table></td></tr></tbody></table>
        </td>
    </tr>

<tr><td><font size="2"><strong>Website: </strong><br></font><font size="2"><a href="/transfer.asp?location=www.abcusd.us" target="_blank">http://www.abcusd.us</a></font></td><td valign="top" width="40%"><strong><font size="2">District Demographics:</font><a href="/Programs/Edge/ACSDashboard/0601620" title="View data for your district in the School District Demographic Dashboard" target="_blank"><img style="height:16px;vertical-align:bottom;margin-bottom:4px;margin-left:2px" src="/ccd/commonfiles/images/ddg_icon.png" title="View data for your district in the School District Demographic Dashboard"></a></strong><a href="/Programs/Edge/ACSDashboard/0601620" title="View data for your district in the School District Demographic Dashboard" target="_blank"><br><font size="2">School District Demographic Dashboard</font></a></td></tr><tr></tr>

    </tbody>

import requests
from bs4 import BeautifulSoup

c_url = 'https://nces.ed.gov/ccd/districtsearch/district_detail.asp?Search=1&details=1&State=06&DistrictType=1&DistrictType=2&DistrictType=3&DistrictType=4&DistrictType=5&DistrictType=6&DistrictType=7&DistrictType=8&DistrictType=9&NumOfStudentsRange=more&NumOfSchoolsRange=more&DistrictPageNum=1&ID2=0601620'

response = requests.get(c_url)

if response.status_code == 200:
    n_soup = BeautifulSoup(response.text, 'html.parser')
    
    # Find all <strong> tags containing "Website:"
    website_strong_tags = n_soup.find_all('strong', text=lambda text: text and "Website:" in text)
    
    # Check if there are any matching <strong> tags
    if website_strong_tags:
        # Get the first matching <strong> tag
        first_website_strong_tag = website_strong_tags[0]
        
        # Extract the "Website" value
        website_value = first_website_strong_tag.next_sibling.strip()
        print("Website:", website_value)
        
        # Find the <a> tag within the same <td> containing the "Website:" text
        a_tag = first_website_strong_tag.find_next('a')
        
        # Check if the <a> tag exists before extracting the URL
        if a_tag:
            url = a_tag['href']
            print("URL:", url)
        else:
            print("No URL found for the 'Website:' link.")
    else:
        print("No 'Website:' value found on the page.")
else:
    print("Failed to retrieve data from the URL. Status code:", response.status_code)

Error : TypeError                                 Traceback (most recent call last)
Cell In[240], line 20
     17 first_website_strong_tag = website_strong_tags[0]
     19 # Extract the "Website" value
---> 20 website_value = first_website_strong_tag.next_sibling.strip()
     21 print("Website:", website_value)
     23 # Find the <a> tag within the same <td> containing the "Website:" text

Expeted : Website : http://www.abcusd.us

Solution

I thought using XPaths might be a nice approach. But I quickly read it's not natively supported:

Can we use XPath with BeautifulSoup? Technically, no. But we can use BeautifulSoup4 with lxml Python library to achieve that.

Source

So, if you are willing to try, something like this might work, I have tested my XPath against your website, but not the rest of the code:

import requests
from bs4 import BeautifulSoup
from lxml import etree

c_url = \
    'https://nces.ed.gov/ccd/districtsearch/district_detail.asp?Search=1&details=1&State=06&DistrictType=1&DistrictType=2&DistrictType=3&DistrictType=4&DistrictType=5&DistrictType=6&DistrictType=7&DistrictType=8&DistrictType=9&NumOfStudentsRange=more&NumOfSchoolsRange=more&DistrictPageNum=1&ID2=0601620'

response = requests.get(c_url)

if response.status_code == 200:

    soup = BeautifulSoup(response.content, 'html.parser')
    body = soup.find('body')

    dom = etree.HTML(str(body))  # Parse the HTML content of the page
xpath_str = \
    '//strong[text()="Website: "]/parent::font/following-sibling::font/a'  # The XPath which goes to the website URL    
print dom.xpath('Website: ', xpath_str)[0].text

Answered By - BernardV

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Monday, December 25, 2023

[FIXED] How to Scrap the Value from the <tr><td><font size="2"><strong> tag

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels