Sunday, March 20, 2022

[FIXED] Scrape values inside span class webpage with beautifulsoup python

March 20, 2022 beautifulsoup, pandas, python, web-scraping No comments

Issue

Hello everyone I have a webpage I'm trying to scrape and the page has tons of span classes and most of which is useless information I posted a section of the span class data that I need but I'm not able to do find.all span because there are 100's of others not needed.

            <div class="col-md-4">
                <p>
                  <span class="text-muted">File Number</span><br>
                  A-21-897274
                </p>
            </div>
            <div class="col-md-4">
              <p>
                <span class="text-muted">Location</span><br>
                Ohio
              </p>
            </div>
              <div class="col-md-4">
                <p>
                  <span class="text-muted">Date</span><br>
                  07/01/2022
                </p>
              </div>
          </div>

I need the span titles:
File Number, Location, Date

and then the values that match:
"A-21-897274", "Ohio", "07/01/2022"

I need this printed out so I can make a pandas data frame. But I cant seem to get the specific spans printed with their value.

What I've tried:

import bs4
from bs4 import BeautifulSoup
soup = BeautifulSoup(..., 'lxml')
for title_tag in soup.find_all('span', class_='text-muted'):

# get the last sibling
*_, value_tag = title_tag.next_siblings

title = title_tag.text.strip()

if isinstance(value_tag, bs4.element.Tag):
    value = value_tag.text.strip()
else:  # it's a navigable string element
    value = value_tag.strip()

print(title, value)

output:

File Number "A-21-897274"
Location "Ohio"
Operations_Manager "Joanna"
Date "07/01/2022"
Type "Transfer"
Status "Open"
ETC "ETC"
ETC "ETC"

This will print out everything I need BUT it also prints out 100's of other values I don't want/need.

Solution

You can use function in soup.find_all to select only wanted elements and then .find_next_sibling() to select the value. For example:

from bs4 import BeautifulSoup


html_doc = """
<div class="col-md-4">
    <p>
      <span class="text-muted">File Number</span><br>
      A-21-897274
    </p>
</div>
<div class="col-md-4">
  <p>
    <span class="text-muted">Location</span><br>
    Ohio
  </p>
</div>
  <div class="col-md-4">
    <p>
      <span class="text-muted">Date</span><br>
      07/01/2022
    </p>
  </div>
</div>
"""

soup = BeautifulSoup(html_doc, "html.parser")


def correct_tag(tag):
    return tag.name == "span" and tag.get_text(strip=True) in {
        "File Number",
        "Location",
        "Date",
    }


for t in soup.find_all(correct_tag):
    print(f"{t.text}: {t.find_next_sibling(text=True).strip()}")

Prints:

File Number: A-21-897274
Location: Ohio
Date: 07/01/2022

Answered By - Andrej Kesely

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Sunday, March 20, 2022

[FIXED] Scrape values inside span class webpage with beautifulsoup python

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels