Issue
I'm trying to write a code to scrape some date from pages about hotels. The final information (name of the hotel and address) should be export to csv. The code works but only on one page...
import requests
import pandas as pd
from bs4 import BeautifulSoup # HTML data structure
page_url = requests.get('https://e-turysta.pl/noclegi-krakow/')
soup = BeautifulSoup(page_url.content, 'html.parser')
list = soup.find(id='nav-lista-obiektow')
items = list.find_all(class_='et-list__details flex-grow-1 d-flex d-md-block flex-column')
nazwa_noclegu = [item.find(class_='h3 et-list__details__name').get_text() for item in items]
adres_noclegu = [item.find(class_='et-list__city').get_text() for item in items]
dane = pd.DataFrame(
{
'nazwa' : nazwa_noclegu,
'adres' : adres_noclegu
}
)
print(dane)
dane.to_csv('noclegi.csv')
I tried a loop but doesn't work:
for i in range(22):
url = requests.get('https://e-turysta.pl/noclegi-krakow/'.format(i+1)).text
soup = BeautifulSoup(url, 'html.parser')
Any ideas?
Solution
Urls are different then you use - you forgot ?page=
.
And you have to use {}
to add value to string
url = 'https://e-turysta.pl/noclegi-krakow/?page={}'.format(i+1)
or concatenate it
url = 'https://e-turysta.pl/noclegi-krakow/?page=' + str(i+1)
or use f-string
url = f'https://e-turysta.pl/noclegi-krakow/?page={i+1}'
EDIT: working code
import requests
from bs4 import BeautifulSoup # HTML data structure
import pandas as pd
def get_page_data(number):
print('number:', number)
url = 'https://e-turysta.pl/noclegi-krakow/?page={}'.format(number)
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
container = soup.find(id='nav-lista-obiektow')
items = container.find_all(class_='et-list__details flex-grow-1 d-flex d-md-block flex-column')
# better group them - so you could add default value if there is no nazwa or adres
dane = []
for item in items:
nazwa = item.find(class_='h3 et-list__details__name').get_text(strip=True)
adres = item.find(class_='et-list__city').get_text(strip=True)
dane.append([nazwa, adres])
return dane
# --- main ---
wszystkie_dane = []
for number in range(1, 23):
dane_na_stronie = get_page_data(number)
wszystkie_dane.extend(dane_na_stronie)
dane = pd.DataFrame(wszystkie_dane, columns=['nazwa', 'adres'])
dane.to_csv('noclegi.csv', index=False)
Answered By - furas
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.