Issue
So i have been trying to write data scraper for online shop with cables and other stuff. I wrote simple code that should work. Shop has structure of products divided to categories and i took on first category with cables.
for i in range(0, 27):
url = "https://onninen.pl/produkty/Kable-i-przewody?query=/strona:{0}"
url = url.format(i)
and it works fine for first two pages with i = to 0 and 1 (i get code_response 200) but no matter what time i try other pages 2+ returns error 500 and i have no idea why especially when they open normally from the same link manually. I even tried to randomize time between requests :( Any idea what might be the problem ? Should i try using other web scraping library ? Below is full code :
import requests
from fake_useragent import UserAgent
import pandas as pd
from bs4 import BeautifulSoup
import time
import random
products = [] # List to store name of the product
MIN = [] # Manufacturer item number
prices = [] # List to store price of the product
df = pd.DataFrame()
user_agent = UserAgent()
i = 0
for i in range(0, 27):
url = "https://onninen.pl/produkty/Kable-i-przewody?query=/strona:{0}"
url = url.format(i)
#print(url)
# getting the response from the page using get method of requests module
page = requests.get(url, headers={"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36"})
#print(page.status_code)
# storing the content of the page in a variable
html = page.content
# creating BeautifulSoup object
page_soup = BeautifulSoup(html, "html.parser")
#print(page_soup.prettify())
for containers in page_soup.findAll('div', {'class': 'styles__ProductsListItem-vrexg1-2 gkrzX'}):
name = containers.find('label', attrs={'class': 'styles__Label-sc-1x6v2mz-2 gmFpMA label'})
price = containers.find('span', attrs={'class': 'styles__PriceValue-sc-33rfvt-10 fVFAzY'})
man_it_num = containers.find('div', attrs={'title': 'Indeks producenta'})
formatted_name = name.text.replace('Dodaj do koszyka: ', '')
products.append(formatted_name)
prices.append(price.text)
MIN.append(man_it_num.text)
df = pd.DataFrame({'Product Name': products, 'Price': prices, 'MIN': MIN})
time.sleep(random.randint(2, 11))
#df.to_excel('output.xlsx', sheet_name='Kable i przewody')
Solution
Because Total pages loaded dynamically via API. So to get all data, you have to use API.
Example:
import pandas as pd
import requests
api_url = 'https://onninen.pl/api/search?query=/Kable-i-przewody/strona:{p}'
headers = {
'user-agent': 'Mozilla/5.0',
'referer': 'https://onninen.pl/produkty/Kable-i-przewody?query=/strona:2',
'cookie': '_gid=GA1.2.1022119173.1663690794; _fuid=60a315c76d054fd5add850c7533f529e; _gcl_au=1.1.1522602410.1663690804; pollsvisible=[]; smuuid=1835bb31183-22686567c511-4116ddce-c55aa071-2639dbd6-ec19e64a550c; _smvs=DIRECT; poll_random_44=1; poll_visited_pages=2; _ga=GA1.2.1956280663.1663690794; smvr=eyJ2aXNpdHMiOjEsInZpZXdzIjo3LCJ0cyI6MTY2MzY5MjU2NTI0NiwibnVtYmVyT2ZSZWplY3Rpb25CdXR0b25DbGljayI6MCwiaXNOZXdTZXNzaW9uIjpmYWxzZX0=; _ga_JXR5QZ2XSJ=GS1.1.1663690794.1.1.1663692567.0.0.0'
}
dfs = []
for p in range(1,28):
d=requests.get(api_url.format(p=p),headers=headers).json()['items'][0]['items']
df = pd.DataFrame(d)
dfs.append(df)
df = pd.concat(dfs)
print(df)
Output:
id slug index catalogindex ... onntopcb isnew qc ads
0 147774 KABLE-ROZNE-MARKI-Kabel-energetyczny-YKY-ZO-3x... HES890 112271067D0500 ... 0 False None None
1 45315 KABLE-ROZNE-MARKI-Kabel-energetyczny-YKY-ZO-3x... HES893 112271068D0500 ... 0 False None None
2 169497 KABLE-ROZNE-MARKI-Kabel-energetyczny-YKY-ZO-3x... HES896 112271069D0500 ... 0 False None None
3 141820 KABLE-ROZNE-MARKI-Kabel-energetyczny-YKY-ZO-4x... HES900 112271056D0500 ... 0 False None None
4 47909 KABLE-ROZNE-MARKI-Kabel-energetyczny-YKY-ZO-4x... HES903 112271064D0500 ... 0 False None None
.. ... ... ... ... ... ... ... ... ...
37 111419 NVENT-RAYCHEM-Kabel-grzejny-EM2-XR-samoreguluj... HDZ938 449561-000 ... 0 True None None
38 176526 NVENT-RAYCHEM-Przewod-stalooporowy-GM-2CW-35m-... HEA099 SZ18300102 ... 0 False None None
39 38484 DEVI-Mata-grzewcza-DEVIheat-150S-150W-m2-375W-... HAJ162 140F0332 ... 1 False None None
40 60982 DEVI-Mata-grzewcza-DEVImat-150T-150W-m2-375W-0... HAJ157 140F0448 ... 1 False None None
41 145612 DEVI-Czujnik-Devireg-850-rynnowy-czujnik-140F1... HAJ212 140F1086 ... 0 False None None
[1292 rows x 27 columns]
Answered By - Fazlul
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.