Tuesday, November 15, 2022

[FIXED] How to scrape data from multiple urls on airbnb with Python?

November 15, 2022 beautifulsoup, python, selenium, web-scraping No comments

Issue

I manage to scrape all the data from the landing page of AirBnB (Price, Name, Ratings etc.), I also know how to use a loop in order use the pagination in order to scrape data from multiple pages.

What I would like to do is to scrape data for each specific listing, i.e data which is within the listing page (description, amenities, etc.).

What I was thinking is to implement same logic as the pagination since I have a list with links but it's difficult to me to understand how can I do it.

Here is the code to scrape the links:

Imports

from bs4 import BeautifulSoup
from selenium import webdriver
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))
import pandas as pd
import time

Getting the page

airbnb_url = 'https://www.airbnb.com/s/Thessaloniki--Greece/homes?tab_id=home_tab&flexible_trip_lengths%5B%5D=one_week&refinement_paths%5B%5D=%2Fhomes&place_id=ChIJ7eAoFPQ4qBQRqXTVuBXnugk&query=Thessaloniki%2C%20Greece&date_picker_type=calendar&search_type=unknown'
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))
driver.get(airbnb_url)

driver.maximize_window()
time.sleep(5)

Scraping links

links = []
soup=BeautifulSoup(driver.page_source, 'lxml')
for card in soup.select('div[class="c4mnd7m dir dir-ltr"]'):
    links.append('https://www.airbnb.com' + card.select_one('a[class="ln2bl2p dir dir-ltr"]')['href'])

What I used to extract the "where to sleep" section is this but probably I am using a wrong tag.

amenities = []
for url in links:
    driver.get(url)
    soup1 = BeautifulSoup(driver.page_source, 'lxml')
    for amenity in soup1.select('div[class="t2pjd0h dir dir-ltr"]'):
        amenities.append(amenity.select_one('div[class="_1r21qb98"]'))

My first question was that and the other one is if any nobody how can I scrape the availability of each listing.

Thanks a lot!

Solution

BeautifulSoup is more comfortable in some situations, but not allways needed in your scraping process - Also avoid selecting your elements by dynamic classes, using time and switch to selenium waits

Iterating over all listing pages I would recommend to use a while-loop to keep your script generic and check in every iteration if there is a next page available, else break your loop. This eliminates the need to manually count pages and entries as well as the use of static range().

try:
    next_page = WebDriverWait(driver, 5).until(EC.element_to_be_clickable((By.CSS_SELECTOR,'[data-section-id="EXPLORE_NUMBERED_PAGINATION:TAB_ALL_HOMES"] button + a'))).get_attribute('href')
except:
    next_page = None

#### process your data

if next_page:
    airbnb_url = next_page
else:
    break

To scrape all of the amenities you have to open the modal via button click:

[i.text.split('\n')[0] for i in WebDriverWait(driver, 5).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR,'[data-testid="modal-container"] [id$="-row-title"]')))]

Note: To avoid errors while other elements get the clicks check if you have to handle cookie banners

To extract bedroom information check for more static information like ids or HTML structure and also check if element is available - This lines extract all infos in this section and creates a dict from heading and value:

if soup.select_one('[data-section-id="SLEEPING_ARRANGEMENT_DEFAULT"] div+div'):
    sleep_areas = list(soup.select_one('[data-section-id="SLEEPING_ARRANGEMENT_DEFAULT"] div+div').stripped_strings)
    d.update(dict(zip(sleep_areas[0::2], sleep_areas[1::2])))
else:
    d.update({'Bedroom':None})

Example

Just to point in a direction and that not everybody has to do a full scrape, I limited scraping of objects in this example to urls[:1] per page, simply remove [:1] to get all results.

import pandas as pd
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

options = webdriver.ChromeOptions()
options.add_argument("--lang=en")

driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()),options=options)
airbnb_url = 'https://www.airbnb.com/s/Thessaloniki--Greece/homes?tab_id=home_tab&flexible_trip_lengths%5B%5D=one_week&refinement_paths%5B%5D=%2Fhomes&place_id=ChIJ7eAoFPQ4qBQRqXTVuBXnugk&query=Thessaloniki%2C%20Greece&date_picker_type=calendar&search_type=unknown'
driver.maximize_window()


data = []

while True:

    driver.get(airbnb_url)
    urls = list(set(a.get_attribute('href') for a in WebDriverWait(driver, 5).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR,'[itemprop="itemListElement"] a')))))
    try:
        next_page = WebDriverWait(driver, 5).until(EC.element_to_be_clickable((By.CSS_SELECTOR,'[data-section-id="EXPLORE_NUMBERED_PAGINATION:TAB_ALL_HOMES"] button + a'))).get_attribute('href')
    except:
        next_page = None

    print('Scrape listings from page:' + str(next_page))

    for url in urls[:1]:
        driver.get(url)
        WebDriverWait(driver, 5).until(EC.element_to_be_clickable((By.CSS_SELECTOR,'[data-section-id="AMENITIES_DEFAULT"] button'))).click()
        soup = BeautifulSoup(driver.page_source)

        d = {
            'title':soup.h1.text,
            'amenities':[i.text.split('\n')[0] for i in WebDriverWait(driver, 5).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR,'[data-testid="modal-container"] [id$="-row-title"]')))]
        }

        if soup.select_one('[data-section-id="SLEEPING_ARRANGEMENT_DEFAULT"] div+div'):
            sleep_areas = list(soup.select_one('[data-section-id="SLEEPING_ARRANGEMENT_DEFAULT"] div+div').stripped_strings)
            d.update(dict(zip(sleep_areas[0::2], sleep_areas[1::2])))
        else:
            d.update({'Bedroom':None})
        data.append(d)

    if next_page:
        airbnb_url = next_page
    else:
        break

pd.DataFrame(data)

Output

	title	amenities	Common space	Bedroom	Living room
0	8 NETFLIX BELGIUM HELLEXPO UNIVERSITY	['', 'Shampoo', 'Essentials', 'Hangers', 'Iron', 'TV', 'Air conditioning', 'Heating', 'Smoke alarm', 'Carbon monoxide alarm', 'Wifi', 'Dedicated workspace', 'Cooking basics', 'Long term stays allowed', 'Unavailable: Security cameras on property', 'Unavailable: Kitchen', 'Unavailable: Washer', 'Unavailable: Private entrance']	1 sofa bed	nan	nan
4	ASOPOO STUDIO	['Hair dryer', 'Shampoo', 'Hot water', 'Essentials', '', 'Bed linens', 'Iron', 'TV', 'Heating', 'Wifi', 'Kitchen', 'Refrigerator', 'Dishes and silverware', 'Free street parking', 'Elevator', 'Paid parking off premises', 'Long term stays allowed', 'Host greets you', 'Unavailable: Washer', 'Unavailable: Air conditioning', 'Unavailable: Smoke alarm', 'Unavailable: Carbon monoxide alarm', 'Unavailable: Private entrance']	1 sofa bed	1 double bed	nan
14	Aristotelous 8th floor 1bd apt with wonderful view	['Hot water', 'Shower gel', 'Free washer – In unit', 'Essentials', 'Hangers', 'Bed linens', 'Iron', 'Drying rack for clothing', 'Clothing storage', 'TV', 'Pack ’n play/Travel crib - available upon request', 'Air conditioning', 'Heating', 'Wifi', 'Dedicated workspace', 'Kitchen', 'Refrigerator', 'Microwave', 'Cooking basics', 'Dishes and silverware', 'Stove', 'Hot water kettle', 'Coffee maker', 'Baking sheet', 'Coffee', 'Dining table', 'Private patio or balcony', 'Outdoor furniture', 'Paid parking off premises', 'Pets allowed', 'Luggage dropoff allowed', 'Long term stays allowed', 'Self check-in', 'Lockbox', 'Unavailable: Security cameras on property', 'Unavailable: Smoke alarm', 'Unavailable: Carbon monoxide alarm', 'Unavailable: Private entrance']	nan	2 double beds	1 sofa bed

Cause there is no expected output available, just some additional thoughts:

If you like to have your amenities not in a list but as string, simply ','.join() them:

'amenities':','.join([i.text.split('\n')[0] for i in WebDriverWait(driver, 5).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR,'[data-testid="modal-container"] [id$="-row-title"]')))])

If you like to have a matrix with true / false you could manipulate your DataFrame

...
df = pd.DataFrame(data)
df = df.explode('amenities')
pd.crosstab(df['title'],df['amenities']).ne(0).rename_axis(index='title',columns=None).reset_index()

Output:

    title       32" HDTV    32" HDTV with standard cable    32" TV  AC - split type ductless system Air conditioning    Babysitter recommendations  Backyard    Baking sheet    ... Unavailable: Kitchen    Unavailable: Private entrance   Unavailable: Security cameras on property   Unavailable: Shampoo    Unavailable: Smoke alarm    Unavailable: TV Unavailable: Washer Washer  Wifi    Wine glasses
0   #SKGH Amaryllis luxury suite -NearHELEXPO   False   False   False   False   False   True    False   False   False   ... False   True    False   False   True    False   False   True    True    False
1   8 NETFLIX BELGIUM HELLEXPO UNIVERSITY   True    False   False   False   False   True    False   False   False   ... True    True    True    False   False   False   True    False   True    False
2   ASOPOO STUDIO   True    False   False   False   False   False   False   False   False   ... False   True    False   False   True    False   True    False   True    False
...

Answered By - HedgeHog

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Tuesday, November 15, 2022

[FIXED] How to scrape data from multiple urls on airbnb with Python?

Issue

Solution

Example

Output

0 comments:

Post a Comment

Popular Posts

Labels