Friday, December 8, 2023

[FIXED] Why doesn't my code strip my list when web scraping?

December 08, 2023 beautifulsoup, python, web-scraping No comments

Issue

So I have a code that scrapes the Zillow page. It scrapes number of bedrooms, bathrooms and its size (in sqft). The list that I get is this:

['1 bd2 ba982 sqft', '3 bds2 ba1,462 sqft', etc.]

but I want to get it to be like this:

['1bd 2ba 982sqft', '3bds 2ba 1,462sqft', etc.]

What should I change in my code:

import requests
from bs4 import BeautifulSoup
# import gspread

URL = "https://www.zillow.com/san-francisco-ca/?searchQueryState=%7B%22pagination%22%3A%7B%7D%2C%22usersSearchTerm%22%3A%22San%20Francisco%2C%20CA%22%2C%22mapBounds%22%3A%7B%22west%22%3A-122.52499667529297%2C%22east%22%3A-122.34166232470703%2C%22south%22%3A37.662044543503555%2C%22north%22%3A37.88836615784793%7D%2C%22regionSelection%22%3A%5B%7B%22regionId%22%3A20330%2C%22regionType%22%3A6%7D%5D%2C%22isMapVisible%22%3Atrue%2C%22filterState%22%3A%7B%22sort%22%3A%7B%22value%22%3A%22days%22%7D%2C%22ah%22%3A%7B%22value%22%3Atrue%7D%7D%2C%22isListVisible%22%3Atrue%2C%22mapZoom%22%3A12%7D"

header = {
  "User-Agent": "YOUR AGENT",
  "Accept-Language": "YOUR LANGUAGEA"
}

response = requests.get(URL, headers=header)

web_page = response.text

soup = BeautifulSoup(web_page, 'lxml')


# Bedrooms and bathrooms
quantity_dirty = soup.find_all("ul", class_="StyledPropertyCardHomeDetailsList-c11n-8-84-3__sc-1xvdaej-0 eYPFID")
quantity_list_clean = [quantity.getText() for quantity in quantity_dirty if not quantity.getText().startswith('--')]
print(quantity_list_clean)

Solution

Try to change a little bit how you extract the text from the <ul>:

import requests
from bs4 import BeautifulSoup

# import gspread

URL = "https://www.zillow.com/san-francisco-ca/?searchQueryState=%7B%22pagination%22%3A%7B%7D%2C%22usersSearchTerm%22%3A%22San%20Francisco%2C%20CA%22%2C%22mapBounds%22%3A%7B%22west%22%3A-122.52499667529297%2C%22east%22%3A-122.34166232470703%2C%22south%22%3A37.662044543503555%2C%22north%22%3A37.88836615784793%7D%2C%22regionSelection%22%3A%5B%7B%22regionId%22%3A20330%2C%22regionType%22%3A6%7D%5D%2C%22isMapVisible%22%3Atrue%2C%22filterState%22%3A%7B%22sort%22%3A%7B%22value%22%3A%22days%22%7D%2C%22ah%22%3A%7B%22value%22%3Atrue%7D%7D%2C%22isListVisible%22%3Atrue%2C%22mapZoom%22%3A12%7D"

header = {"User-Agent": "YOUR AGENT", "Accept-Language": "YOUR LANGUAGEA"}

response = requests.get(URL, headers=header)

web_page = response.text

soup = BeautifulSoup(web_page, "lxml")


# Bedrooms and bathrooms
quantity_dirty = soup.select(
    "ul.StyledPropertyCardHomeDetailsList-c11n-8-84-3__sc-1xvdaej-0.eYPFID"
)
quantity_list_clean = [
    " ".join(
        q.getText(strip=True)
        for q in quantity.select("li")
        if not q.getText().startswith("--")
    )
    for quantity in quantity_dirty
]
print(quantity_list_clean)

Prints:

[
    "1bd 2ba 982sqft",
    "3bds 2ba 1,462sqft",
    "1bd 1ba 1,310sqft",
    "2bds 3ba 2,860sqft",
    "2bds 1ba 682sqft",
    "2bds 2ba 835sqft",
    "3bds 2ba 1,550sqft",
    "1bd 1ba 819sqft",
    "4bds 4ba 2,568sqft",
]

Answered By - Andrej Kesely

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Friday, December 8, 2023

[FIXED] Why doesn't my code strip my list when web scraping?

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels