Issue
For better context, the url I am scraping/parsing is: https://www.dreamflows.com/xlist-ca.php
I'm using SwiftSoup to parse HTML but from the documentation, I'm not sure if it's possible. The html has a bunch of these rows and the text I need is unfortunately not in an element (not even a p tag). It's always following an img tag.
Specifically,
-
'<img src="./Dreamflows California Cross-Listing_files/pixelshim.gif" width="12" border="0">'
Name of river section
-
Note that there is a similar scenario where there is
-
<a href="https://www.dreamflows.com/xlist-ca.php#Special_Symbols"><img src="./Dreamflows California Cross-Listing_files/querySym.gif" border="0"></a>
Name of a river section
-
I don't want these.
So in my example, only North Fork Smith River - Low Divide Rd to Gasquet (14.6 miles, III to V-, H&S p69) should be parsed.
Solution
You could use the following selector to identify the image img[src$="pixelshim.gif"]
and from there pick the text.
based on your tags this example uses beautifulsoup (python) just for demonstration
from bs4 import BeautifulSoup
import requests
soup = BeautifulSoup(
requests.get('https://www.dreamflows.com/xlist-ca.php').text
)
for e in soup.select('img[src$="pixelshim.gif"]'):
print(e.next)
Output
North Fork Smith River - Low Divide Rd to Gasquet (14.6 miles, III to V-, H&S p69)
Middle Fork Smith River - Siskiyou Gorge (0.8 mile, IV+ to V, AWetState)
Middle Fork Smith River - Patrick Creek Run (8.4 miles, III+ to IV, H&S p71)
North Fork Smith River - Low Divide Rd to Gasquet (14.6 miles, III to V-, H&S p69)
Middle Fork Smith River - Patrick Creek Run (8.4 miles, III+ to IV, H&S p71)
Smith River - Gasquet Run (4.5 miles, II+ to III, Creekin')
...
Answered By - HedgeHog
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.