Issue
I'm new to web scraping with Python, having just started last week. I have a question about how to extract information using HTML tags, particularly when dealing with tags. On the website "https://nces.ed.gov/ccd/districtsearch/district_detail.asp?Search=1&details=1&State=06&DistrictType=1&DistrictType=2&DistrictType=3&DistrictType=4&DistrictType=5&DistrictType=6&DistrictType=7&DistrictType=8&DistrictType=9&NumOfStudentsRange=more&NumOfSchoolsRange=more&DistrictPageNum=1&ID2=0601620", I'd like to scrape the "Website" value located in the bottom left table.
In my limited experience, I've tried various methods but haven't found a solution. I'm struggling to understand how to approach this. I need to perform this extraction dynamically for a total of 2115 pages. The URL I'm trying to extract is "http://www.abcusd.us" from the provided HTML content. I would greatly appreciate any suggestions you can provide. Thank you.
<tbody><tr>
<td valign="top" width="220"><b><font size="2">District Name:</font></b><br>
<font size="3">ABC Unified<br></font><font size="2"><a href="../schoolsearch/school_list.asp?Search=1&DistrictID=0601620">schools for this district</a></font></td>
<td valign="top" width="220">
<b><font size="2">NCES District ID:</font></b><br><font size="3">0601620</font></td>
<td valign="top"><b><font size="2">State District ID:</font></b><br><font size="3">
CA-1964212</font></td>
</tr>
<tr>
<td valign="top" width="220"><img border="0" src="/ccd/commonfiles/images/spacer.gif" width="34" height="1"></td>
<td valign="top" width="220"><img border="0" src="/ccd/commonfiles/images/spacer.gif" width="34" height="1"></td>
<td valign="top"><img border="0" src="/ccd/commonfiles/images/spacer.gif" width="34" height="1"></td>
</tr>
<tr>
<td valign="top" width="220"><b><font size="2">Mailing Address:</font></b><br><font size="3">16700 Norwalk BLVD.<br>Cerritos, CA 90703-1838</font></td><td valign="top" width="40%"><strong><font size="2">Physical Address:</font> <a href="/ccd/schoolmap/#district_ids=0601620" title="Map latest data in the School & District Navigator" target="_blank"><img style="height:20px;vertical-align:middle;margin-bottom:-20px;margin-top:-32px" src="/ccd/commonfiles/images/mapapp_icon.png" title="Map latest data in the School & District Navigator"></a></strong><br><font size="3"><a href="/ccd/schoolmap/#district_ids=0601620" title="Map latest data in the School & District Navigator" target="_blank">16700 Norwalk BLVD.<br>Cerritos, CA 90703-1838</a></font></td>
<td valign="top"><b><font size="2">Phone:</font></b><br><font size="3">
(562)926-5566</font>
</td>
</tr>
<tr>
<td valign="top">
<p align=""><b><font size="2">Type: </font></b><br>
<font size="3">Regular local school district</font>
</p></td>
<td valign="top">
<p align=""><b><font size="2">Status:</font></b><br>
<font size="3">Open</font>
</p></td>
<td valign="top">
<p align=""><b><font size="2">Total Schools:</font></b><br>
<font size="3">31</font>
</p></td>
</tr>
<tr>
<td valign="top">
<b><font size="2">Supervisory Union #: </font></b><br>
<font size="3">N/A</font>
</td>
<td valign="top" colspan="2">
<b><font size="2">Grade Span: </font></b>
<font size="2"> (grades KG - 12)</font>
<br>
<table><tbody><tr><td><table border="0" bordercolor="#134F8A" cellspacing="0" cellpadding="0" bgcolor="#134F8A"><tbody><tr><td width="100%" bordercolor="#134F8A"><table border="0" cellspacing="1" cellpadding="0"><tbody><tr><td width="16" align="center" bgcolor="#23619E"><font size="2"> </font></td><td width="16" align="center" bgcolor="#F5F196"><font size="2">KG</font></td><td width="16" align="center" bgcolor="#F5F196"><font size="2">1</font></td><td width="16" align="center" bgcolor="#F5F196"><font size="2">2</font></td><td width="16" align="center" bgcolor="#F5F196"><font size="2">3</font></td><td width="16" align="center" bgcolor="#F5F196"><font size="2">4</font></td><td width="16" align="center" bgcolor="#F5F196"><font size="2">5</font></td><td width="16" align="center" bgcolor="#F5F196"><font size="2">6</font></td><td width="16" align="center" bgcolor="#F5F196"><font size="2">7</font></td><td width="16" align="center" bgcolor="#F5F196"><font size="2">8</font></td><td width="16" align="center" bgcolor="#F5F196"><font size="2">9</font></td><td width="16" align="center" bgcolor="#F5F196"><font size="2">10</font></td><td width="16" align="center" bgcolor="#F5F196"><font size="2">11</font></td><td width="16" align="center" bgcolor="#F5F196"><font size="2">12</font></td></tr></tbody></table></td></tr></tbody></table></td></tr></tbody></table>
</td>
</tr>
<tr><td><font size="2"><strong>Website: </strong><br></font><font size="2"><a href="/transfer.asp?location=www.abcusd.us" target="_blank">http://www.abcusd.us</a></font></td><td valign="top" width="40%"><strong><font size="2">District Demographics:</font><a href="/Programs/Edge/ACSDashboard/0601620" title="View data for your district in the School District Demographic Dashboard" target="_blank"><img style="height:16px;vertical-align:bottom;margin-bottom:4px;margin-left:2px" src="/ccd/commonfiles/images/ddg_icon.png" title="View data for your district in the School District Demographic Dashboard"></a></strong><a href="/Programs/Edge/ACSDashboard/0601620" title="View data for your district in the School District Demographic Dashboard" target="_blank"><br><font size="2">School District Demographic Dashboard</font></a></td></tr><tr></tr>
</tbody>
import requests
from bs4 import BeautifulSoup
c_url = 'https://nces.ed.gov/ccd/districtsearch/district_detail.asp?Search=1&details=1&State=06&DistrictType=1&DistrictType=2&DistrictType=3&DistrictType=4&DistrictType=5&DistrictType=6&DistrictType=7&DistrictType=8&DistrictType=9&NumOfStudentsRange=more&NumOfSchoolsRange=more&DistrictPageNum=1&ID2=0601620'
response = requests.get(c_url)
if response.status_code == 200:
n_soup = BeautifulSoup(response.text, 'html.parser')
# Find all <strong> tags containing "Website:"
website_strong_tags = n_soup.find_all('strong', text=lambda text: text and "Website:" in text)
# Check if there are any matching <strong> tags
if website_strong_tags:
# Get the first matching <strong> tag
first_website_strong_tag = website_strong_tags[0]
# Extract the "Website" value
website_value = first_website_strong_tag.next_sibling.strip()
print("Website:", website_value)
# Find the <a> tag within the same <td> containing the "Website:" text
a_tag = first_website_strong_tag.find_next('a')
# Check if the <a> tag exists before extracting the URL
if a_tag:
url = a_tag['href']
print("URL:", url)
else:
print("No URL found for the 'Website:' link.")
else:
print("No 'Website:' value found on the page.")
else:
print("Failed to retrieve data from the URL. Status code:", response.status_code)
Error : TypeError Traceback (most recent call last)
Cell In[240], line 20
17 first_website_strong_tag = website_strong_tags[0]
19 # Extract the "Website" value
---> 20 website_value = first_website_strong_tag.next_sibling.strip()
21 print("Website:", website_value)
23 # Find the <a> tag within the same <td> containing the "Website:" text
Expeted : Website : http://www.abcusd.us
Solution
I thought using XPaths might be a nice approach. But I quickly read it's not natively supported:
Can we use XPath with BeautifulSoup? Technically, no. But we can use BeautifulSoup4 with lxml Python library to achieve that.
So, if you are willing to try, something like this might work, I have tested my XPath against your website, but not the rest of the code:
import requests
from bs4 import BeautifulSoup
from lxml import etree
c_url = \
'https://nces.ed.gov/ccd/districtsearch/district_detail.asp?Search=1&details=1&State=06&DistrictType=1&DistrictType=2&DistrictType=3&DistrictType=4&DistrictType=5&DistrictType=6&DistrictType=7&DistrictType=8&DistrictType=9&NumOfStudentsRange=more&NumOfSchoolsRange=more&DistrictPageNum=1&ID2=0601620'
response = requests.get(c_url)
if response.status_code == 200:
soup = BeautifulSoup(response.content, 'html.parser')
body = soup.find('body')
dom = etree.HTML(str(body)) # Parse the HTML content of the page
xpath_str = \
'//strong[text()="Website: "]/parent::font/following-sibling::font/a' # The XPath which goes to the website URL
print dom.xpath('Website: ', xpath_str)[0].text
Answered By - BernardV
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.