Issue
I am gathering housing data from zillow's website.So far I have gathered data from the first webpage.For my next step, I am trying to find links to the next button, which will navigate me to page 2, page 3, and so on. I used the Inspect feature of Chrome to locate the 'next button' button, which has the following structure
<a href=”/homes/recently_sold/house_type/47164_rid/0_singlestory/37.720288,-121.859322,37.601788,-121.918888_rect/12_zm/2_p/” class=”on” onclick=”SearchMain.changePage(2);return false;” id=”yui_3_18_1_1_1525048531062_27962">Next</a>
I then used Beautiful Soup’s find_all method and filter on tag “a” and class “on”.I used the following code to extract all the links
driver = webdriver.Chrome(chromedriver)
zillow_bellevue_1="https://www.zillow.com/homes/Bellevue-WA-98004_rb/"
driver.get(zillow_bellevue_1)
soup = BeautifulSoup(driver.page_source,'html.parser')
next_button = soup.find_all("a", class_="on")
print(next_button)
I am not getting any output.Any inputs on where I am going wrong?
Solution
The class for the next
button appears to be off
not on
, as such you could scrape details of each property and advance through all the pages as follows. It uses the requests
library to get the HTML which should be faster than using a chrome driver.
from bs4 import BeautifulSoup
import requests
base_url = "https://www.zillow.com"
url = base_url + "/homes/Bellevue-WA-98004_rb/"
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36'}
while url:
req = requests.get(url, headers=headers)
soup = BeautifulSoup(req.content, 'html.parser')
print('\n' + url)
for div in soup.find_all('div', class_="zsg-photo-card-caption"):
print(" {}".format(list(div.stripped_strings)))
next_button = soup.find("a", class_="off", href=True)
url = base_url + next_button['href'] if next_button else None
This continues requesting URLs until no next button is found. The output would be of the form:
https://www.zillow.com/homes/Bellevue-WA-98004_rb/
['New Construction', '$2,224,995+', '5 bds', '·', '4 ba', '·', '3,796+ sqft', 'The Castille Plan, Verano', 'D.R. Horton - Seattle']
['12 Central Square', '2', '$2,550+', '10290 NE 12th St, Bellevue, WA']
['Apartment For Rent', '$1,800/mo', '1 bd', '·', '1 ba', '·', '812 sqft', '10423 NE 32nd Pl APT E105, Bellevue, WA']
['House For Sale', '$1,898,000', '5 bds', '·', '4 ba', '·', '4,030 sqft', '3230 108th Ave SE, Bellevue, WA', 'Quorum Real Estate/Madison Inc']
['New Construction', '-- bds', '·', '-- ba', '·', '-- sqft', 'Coming Soon Plan, Northtowne', 'D.R. Horton - Seattle']
['The Meyden', '0', '$1,661+', '1', '$2,052+', '2', '$3,240+', '10333 Main St, Bellevue, WA']
Answered By - Martin Evans
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.