Issue
I manage to scrape all the data from the landing page of AirBnB (Price, Name, Ratings etc.), I also know how to use a loop in order use the pagination in order to scrape data from multiple pages.
What I would like to do is to scrape data for each specific listing, i.e data which is within the listing page (description, amenities, etc.).
What I was thinking is to implement same logic as the pagination since I have a list
with links but it's difficult to me to understand how can I do it.
Here is the code to scrape the links:
Imports
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))
import pandas as pd
import time
Getting the page
airbnb_url = 'https://www.airbnb.com/s/Thessaloniki--Greece/homes?tab_id=home_tab&flexible_trip_lengths%5B%5D=one_week&refinement_paths%5B%5D=%2Fhomes&place_id=ChIJ7eAoFPQ4qBQRqXTVuBXnugk&query=Thessaloniki%2C%20Greece&date_picker_type=calendar&search_type=unknown'
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))
driver.get(airbnb_url)
driver.maximize_window()
time.sleep(5)
Scraping links
links = []
soup=BeautifulSoup(driver.page_source, 'lxml')
for card in soup.select('div[class="c4mnd7m dir dir-ltr"]'):
links.append('https://www.airbnb.com' + card.select_one('a[class="ln2bl2p dir dir-ltr"]')['href'])
What I used to extract the "where to sleep" section is this but probably I am using a wrong tag.
amenities = []
for url in links:
driver.get(url)
soup1 = BeautifulSoup(driver.page_source, 'lxml')
for amenity in soup1.select('div[class="t2pjd0h dir dir-ltr"]'):
amenities.append(amenity.select_one('div[class="_1r21qb98"]'))
My first question was that and the other one is if any nobody how can I scrape the availability of each listing.
Thanks a lot!
Solution
BeautifulSoup
is more comfortable in some situations, but not allways needed in your scraping process - Also avoid selecting your elements by dynamic classes, using time
and switch to selenium waits
Iterating over all listing pages I would recommend to use a while-loop
to keep your script generic and check in every iteration if there is a next page available, else break
your loop. This eliminates the need to manually count pages and entries as well as the use of static range()
.
try:
next_page = WebDriverWait(driver, 5).until(EC.element_to_be_clickable((By.CSS_SELECTOR,'[data-section-id="EXPLORE_NUMBERED_PAGINATION:TAB_ALL_HOMES"] button + a'))).get_attribute('href')
except:
next_page = None
#### process your data
if next_page:
airbnb_url = next_page
else:
break
To scrape all of the amenities you have to open the modal via button click:
[i.text.split('\n')[0] for i in WebDriverWait(driver, 5).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR,'[data-testid="modal-container"] [id$="-row-title"]')))]
Note: To avoid errors while other elements get the clicks check if you have to handle cookie banners
To extract bedroom information check for more static information like ids or HTML structure and also check if element is available - This lines extract all infos in this section and creates a dict
from heading and value:
if soup.select_one('[data-section-id="SLEEPING_ARRANGEMENT_DEFAULT"] div+div'):
sleep_areas = list(soup.select_one('[data-section-id="SLEEPING_ARRANGEMENT_DEFAULT"] div+div').stripped_strings)
d.update(dict(zip(sleep_areas[0::2], sleep_areas[1::2])))
else:
d.update({'Bedroom':None})
Example
Just to point in a direction and that not everybody has to do a full scrape, I limited scraping of objects in this example to urls[:1]
per page, simply remove [:1]
to get all results.
import pandas as pd
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
options = webdriver.ChromeOptions()
options.add_argument("--lang=en")
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()),options=options)
airbnb_url = 'https://www.airbnb.com/s/Thessaloniki--Greece/homes?tab_id=home_tab&flexible_trip_lengths%5B%5D=one_week&refinement_paths%5B%5D=%2Fhomes&place_id=ChIJ7eAoFPQ4qBQRqXTVuBXnugk&query=Thessaloniki%2C%20Greece&date_picker_type=calendar&search_type=unknown'
driver.maximize_window()
data = []
while True:
driver.get(airbnb_url)
urls = list(set(a.get_attribute('href') for a in WebDriverWait(driver, 5).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR,'[itemprop="itemListElement"] a')))))
try:
next_page = WebDriverWait(driver, 5).until(EC.element_to_be_clickable((By.CSS_SELECTOR,'[data-section-id="EXPLORE_NUMBERED_PAGINATION:TAB_ALL_HOMES"] button + a'))).get_attribute('href')
except:
next_page = None
print('Scrape listings from page:' + str(next_page))
for url in urls[:1]:
driver.get(url)
WebDriverWait(driver, 5).until(EC.element_to_be_clickable((By.CSS_SELECTOR,'[data-section-id="AMENITIES_DEFAULT"] button'))).click()
soup = BeautifulSoup(driver.page_source)
d = {
'title':soup.h1.text,
'amenities':[i.text.split('\n')[0] for i in WebDriverWait(driver, 5).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR,'[data-testid="modal-container"] [id$="-row-title"]')))]
}
if soup.select_one('[data-section-id="SLEEPING_ARRANGEMENT_DEFAULT"] div+div'):
sleep_areas = list(soup.select_one('[data-section-id="SLEEPING_ARRANGEMENT_DEFAULT"] div+div').stripped_strings)
d.update(dict(zip(sleep_areas[0::2], sleep_areas[1::2])))
else:
d.update({'Bedroom':None})
data.append(d)
if next_page:
airbnb_url = next_page
else:
break
pd.DataFrame(data)
Output
title | amenities | Common space | Bedroom | Living room | |
---|---|---|---|---|---|
0 | 8 NETFLIX BELGIUM HELLEXPO UNIVERSITY | ['', 'Shampoo', 'Essentials', 'Hangers', 'Iron', 'TV', 'Air conditioning', 'Heating', 'Smoke alarm', 'Carbon monoxide alarm', 'Wifi', 'Dedicated workspace', 'Cooking basics', 'Long term stays allowed', 'Unavailable: Security cameras on property', 'Unavailable: Kitchen', 'Unavailable: Washer', 'Unavailable: Private entrance'] | 1 sofa bed | nan | nan |
4 | ASOPOO STUDIO | ['Hair dryer', 'Shampoo', 'Hot water', 'Essentials', '', 'Bed linens', 'Iron', 'TV', 'Heating', 'Wifi', 'Kitchen', 'Refrigerator', 'Dishes and silverware', 'Free street parking', 'Elevator', 'Paid parking off premises', 'Long term stays allowed', 'Host greets you', 'Unavailable: Washer', 'Unavailable: Air conditioning', 'Unavailable: Smoke alarm', 'Unavailable: Carbon monoxide alarm', 'Unavailable: Private entrance'] | 1 sofa bed | 1 double bed | nan |
14 | Aristotelous 8th floor 1bd apt with wonderful view | ['Hot water', 'Shower gel', 'Free washer – In unit', 'Essentials', 'Hangers', 'Bed linens', 'Iron', 'Drying rack for clothing', 'Clothing storage', 'TV', 'Pack ’n play/Travel crib - available upon request', 'Air conditioning', 'Heating', 'Wifi', 'Dedicated workspace', 'Kitchen', 'Refrigerator', 'Microwave', 'Cooking basics', 'Dishes and silverware', 'Stove', 'Hot water kettle', 'Coffee maker', 'Baking sheet', 'Coffee', 'Dining table', 'Private patio or balcony', 'Outdoor furniture', 'Paid parking off premises', 'Pets allowed', 'Luggage dropoff allowed', 'Long term stays allowed', 'Self check-in', 'Lockbox', 'Unavailable: Security cameras on property', 'Unavailable: Smoke alarm', 'Unavailable: Carbon monoxide alarm', 'Unavailable: Private entrance'] | nan | 2 double beds | 1 sofa bed |
Cause there is no expected output available, just some additional thoughts:
If you like to have your amenities not in a list
but as string, simply ','.join()
them:
'amenities':','.join([i.text.split('\n')[0] for i in WebDriverWait(driver, 5).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR,'[data-testid="modal-container"] [id$="-row-title"]')))])
If you like to have a matrix with true / false
you could manipulate your DataFrame
...
df = pd.DataFrame(data)
df = df.explode('amenities')
pd.crosstab(df['title'],df['amenities']).ne(0).rename_axis(index='title',columns=None).reset_index()
Output:
title 32" HDTV 32" HDTV with standard cable 32" TV AC - split type ductless system Air conditioning Babysitter recommendations Backyard Baking sheet ... Unavailable: Kitchen Unavailable: Private entrance Unavailable: Security cameras on property Unavailable: Shampoo Unavailable: Smoke alarm Unavailable: TV Unavailable: Washer Washer Wifi Wine glasses
0 #SKGH Amaryllis luxury suite -NearHELEXPO False False False False False True False False False ... False True False False True False False True True False
1 8 NETFLIX BELGIUM HELLEXPO UNIVERSITY True False False False False True False False False ... True True True False False False True False True False
2 ASOPOO STUDIO True False False False False False False False False ... False True False False True False True False True False
...
Answered By - HedgeHog
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.