Monday, January 31, 2022

[FIXED] In beautifulsoup4, when scraping a website purely based off of an element and the text within, how do you return more than one result?

January 31, 2022 beautifulsoup, html, python, web-scraping No comments

Issue

I'm currently at the end of my rope dealing with a frustrating program, and I'm posting here for help for the first time. Using beautifulsoup4, I'm attempting to, in short, scrape a website with no reliable HTML classes or IDs to work with. All I have is the anchor element and, for the example I'm providing below, I am attempting to grab the phrase "Where the Red Fern Grows" using only the the lowercase text "red fern". So in conclusion, I am attempting to identify and collect/print the text of each unclassified/unidentified anchor element that contains the phrase "Where the Red Fern Grows", without having to type the entire string and remain case insensitive.

I've tried a multitude of things so far, with my greatest success being only a half measure. I was able to successfully collect the very first anchor element that contained 'WTRFG'. Unfortunately, despite my best efforts, that's about as much as I've been able to get. I've used both find and find_all, tried to use re.search with regex, and tried a number of other things I found in other stack overflow answers. No dice. Here's what I got right now.

import bs4
import requests
import re
import pretty_errors

url = "http://fake.site/search.php?req=where+the+red+fern+grows&lg_topic=fakesite&open=0&view=simple&res=25&phrase=1&column=def"
page = requests.get(url)
fernSoup = bs4.BeautifulSoup(page.content, "html.parser")
redFern = "red fern"

print(type(fernSoup))
print(type(redFern))

anchor = fernSoup.find_all("a", class_=False, text=lambda text: text and redFern in text.lower())

print(anchor)

Which outputs as:

<class 'bs4.BeautifulSoup'>
<class 'str'>
[<a href="book/index.php?md5=82C10FF9DA122C4B1061F83555F3800E" id="796869" title="">Where The Red Fern Grows</a>]

# This is only the first of three different results, but the only one I can access usually. The other two contain the exact same structure, minus differences in the href url and ID number.

Any advice would be greatly appreciated, and thank you for taking the time to read my post.

Edit: The three anchors I am attempting to access, copy pasted directly from the result of print(fernSoup)

<td width="500"><a href="book/index.php?md5=82C10FF9DA122C4B1061F83555F3800E" id="796869" title="">Where The Red Fern Grows</a></td>

<td width="500"><a href="book/index.php?md5=3C96145628CC4759595FB3C1A767673A" id="1157998" title="">Where the Red Fern Grows<br/> <font color="green" face="Times"><i>0553274295</i></font></a></td>

<td width="500"><a href="book/index.php?md5=9DD3079644E043E530682DA95C95B999" id="2413155" title="">Where the Red Fern Grows: The Story of Two Dogs and a Boy<br/> <font color="green" face="Times"><i>978-0-307-78156-7, 0307781569, 0553274295, 9780440412670</i></

Solution

To select multiple <a> tags with the text "red fern", you can do:

from bs4 import BeautifulSoup

html_doc = """
 <td width="500"><a href="book/index.php?md5=82C10FF9DA122C4B1061F83555F3800E" id="796869" title="">Where The Red Fern Grows</a></td> <td width="500"><a href="book/index.php?md5=3C96145628CC4759595FB3C1A767673A" id="1157998" title="">Where the Red Fern Grows<br/> <font color="green" face="Times"><i>0553274295</i></font></a></td> 
"""

fernSoup = BeautifulSoup(html_doc, "html.parser")
redFern = "red fern"

anchor = fernSoup.find_all(
    lambda tag: tag.name == "a" and redFern in tag.text.lower()
)

print(anchor)

Prints:

[<a href="book/index.php?md5=82C10FF9DA122C4B1061F83555F3800E" id="796869" title="">Where The Red Fern Grows</a>, <a href="book/index.php?md5=3C96145628CC4759595FB3C1A767673A" id="1157998" title="">Where the Red Fern Grows<br/> <font color="green" face="Times"><i>0553274295</i></font></a>]

Or CSS selector (but this is case sensitive):

print(fernSoup.select('a:-soup-contains("Red Fern")'))

Prints:

[<a href="book/index.php?md5=82C10FF9DA122C4B1061F83555F3800E" id="796869" title="">Where The Red Fern Grows</a>, <a href="book/index.php?md5=3C96145628CC4759595FB3C1A767673A" id="1157998" title="">Where the Red Fern Grows<br/> <font color="green" face="Times"><i>0553274295</i></font></a>]

Answered By - Andrej Kesely

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Monday, January 31, 2022

[FIXED] In beautifulsoup4, when scraping a website purely based off of an element and the text within, how do you return more than one result?

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels