Issue
I have been unsuccessful in trying to gather data from Zillow.
Example:
url = https://www.zillow.com/homes/for_sale/Los-Angeles-CA_rb/?fromHomePage=true&shouldFireSellPageImplicitClaimGA=false&fromHomePageTab=buy
I want to pull information like addresses, prices, zestimates, locations from all homes in LA.
I have tried HTML scraping using packages like BeautifulSoup. I also have tried using the json. I'm almost positive that Zillow's API will not be helpful. It's my understanding that the API is best for gathering information on a specific property.
I have been able to scrape information from other sites but it seems that Zillow uses dynamic ids (change every refresh) making it more difficult to access that information.
UPDATE: Tried using the below code but am still not producing any results
import requests
from bs4 import BeautifulSoup
url = 'https://www.zillow.com/homes/for_sale/Los-Angeles-CA_rb/?fromHomePage=true&shouldFireSellPageImplicitClaimGA=false&fromHomePageTab=buy'
page = requests.get(url)
data = page.content
soup = BeautifulSoup(data, 'html.parser')
for li in soup.find_all('div', {'class': 'zsg-photo-card-caption'}):
try:
#There is sponsored links in the list. You might need to take care
#of that
#Better check for null values which we are not doing in here
print(li.find('span', {'class': 'zsg-photo-card-price'}).text)
print(li.find('span', {'class': 'zsg-photo-card-info'}).text)
print(li.find('span', {'class': 'zsg-photo-card-address'}).text)
print(li.find('span', {'class': 'zsg-photo-card-broker-name'}).text)
except :
print('An error occured')
Solution
It's probably because you're not passing headers.
If you take a look at Chrome's network tab in developer tools, these are the headers that are passed by the browser:
:authority:www.zillow.com
:method:GET
:path:/homes/for_sale/Los-Angeles-CA_rb/?fromHomePage=true&shouldFireSellPageImplicitClaimGA=false&fromHomePageTab=buy
:scheme:https
accept:text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8
accept-encoding:gzip, deflate, br
accept-language:en-US,en;q=0.8
upgrade-insecure-requests:1
user-agent:Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36
However, if you try sending all of them, it'll fail, because requests
doesn't let you send headers beginning with a colon ':'.
I tried skipping those four alone, and used the other five in this script. It worked. So try this:
from bs4 import BeautifulSoup
import requests
req_headers = {
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
'accept-encoding': 'gzip, deflate, br',
'accept-language': 'en-US,en;q=0.8',
'upgrade-insecure-requests': '1',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'
}
with requests.Session() as s:
url = 'https://www.zillow.com/homes/for_sale/Los-Angeles-CA_rb/?fromHomePage=true&shouldFireSellPageImplicitClaimGA=false&fromHomePageTab=buy'
r = s.get(url, headers=req_headers)
After that, you can use BeautifulSoup
to extract the information you need:
soup = BeautifulSoup(r.content, 'lxml')
price = soup.find('span', {'class': 'zsg-photo-card-price'}).text
info = soup.find('span', {'class': 'zsg-photo-card-info'}).text
address = soup.find('span', {'itemprop': 'address'}).text
Here is a sample of data extracted from that page:
+--------------+-----------------------------------------------------------+
| $615,000 | 121 S Hope St APT 435 Los Angeles CA 90012 |
| $330,000 | 4859 Coldwater Canyon Ave APT 14A Sherman Oaks CA 91423 |
| $3,495,000 | 13446 Valley Vista Blvd Sherman Oaks CA 91423 |
| $1,199,000 | 6241 Crescent Park W UNIT 410 Los Angeles CA 90094 |
| $771,472+ | Chase St. And Woodley Ave # HGS0YX North Hills CA 91343 |
| $369,000 | 8650 Gulana Ave UNIT L2179 Playa Del Rey CA 90293 |
| $595,000 | 6427 Klump Ave North Hollywood CA 91606 |
+--------------+-----------------------------------------------------------+
Answered By - user4066647
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.