Sunday, March 6, 2022

[FIXED] Multiple Pages Web Scraping with Python and Beautiful Soup

March 06, 2022 beautifulsoup, loops, pandas, python, python-requests No comments

Issue

I'm trying to write a code to scrape some date from pages about hotels. The final information (name of the hotel and address) should be export to csv. The code works but only on one page...

import requests
import pandas as pd
from bs4 import BeautifulSoup # HTML data structure

page_url = requests.get('https://e-turysta.pl/noclegi-krakow/')
soup = BeautifulSoup(page_url.content, 'html.parser')

list = soup.find(id='nav-lista-obiektow')
items = list.find_all(class_='et-list__details flex-grow-1 d-flex d-md-block flex-column')

nazwa_noclegu = [item.find(class_='h3 et-list__details__name').get_text() for item in items]
adres_noclegu = [item.find(class_='et-list__city').get_text() for item in items]

dane = pd.DataFrame(
    {
        'nazwa' : nazwa_noclegu,
        'adres' : adres_noclegu
    }
)

print(dane)

dane.to_csv('noclegi.csv')

I tried a loop but doesn't work:

for i in range(22):
    url = requests.get('https://e-turysta.pl/noclegi-krakow/'.format(i+1)).text
    soup = BeautifulSoup(url, 'html.parser')

Any ideas?

Solution

Urls are different then you use - you forgot ?page=.

And you have to use {} to add value to string

url = 'https://e-turysta.pl/noclegi-krakow/?page={}'.format(i+1)

or concatenate it

url = 'https://e-turysta.pl/noclegi-krakow/?page=' + str(i+1)

or use f-string

url = f'https://e-turysta.pl/noclegi-krakow/?page={i+1}'

EDIT: working code

import requests
from bs4 import BeautifulSoup # HTML data structure
import pandas as pd

def get_page_data(number):
    print('number:', number)
    
    url = 'https://e-turysta.pl/noclegi-krakow/?page={}'.format(number)
    response = requests.get(url)
    soup = BeautifulSoup(response.content, 'html.parser')

    container = soup.find(id='nav-lista-obiektow')
    items = container.find_all(class_='et-list__details flex-grow-1 d-flex d-md-block flex-column')

    # better group them - so you could add default value if there is no nazwa or adres
    dane = []
    
    for item in items:
        nazwa = item.find(class_='h3 et-list__details__name').get_text(strip=True)
        adres = item.find(class_='et-list__city').get_text(strip=True)
        dane.append([nazwa, adres])
        
    return dane

# --- main ---

wszystkie_dane = []
for number in range(1, 23):
    dane_na_stronie = get_page_data(number)
    wszystkie_dane.extend(dane_na_stronie)

dane = pd.DataFrame(wszystkie_dane, columns=['nazwa', 'adres'])

dane.to_csv('noclegi.csv', index=False)

Answered By - furas

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Sunday, March 6, 2022

[FIXED] Multiple Pages Web Scraping with Python and Beautiful Soup

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels