Issue
I am trying to learn scrapy and do crawl for a website, but I am getting a 403 response when doing crawl
this is my spider:
import scrapy
from scrapy.loader import ItemLoader
from itemloaders.processors import TakeFirst, MapCompose
from w3lib.html import remove_tags
def remove_currency(value):
return value.replace('£','').strip()
class WhiskyscraperItem(scrapy.Item):
# define the fields for your item here like:
name = scrapy.Field(input_processor = MapCompose(remove_tags), output_processor = TakeFirst())
price = scrapy.Field(input_processor = MapCompose(remove_tags, remove_currency), output_processor = TakeFirst())
link = scrapy.Field()
class WhiskeySpider(scrapy.Spider):
name = 'whisky'
start_urls = ['https://www.whiskyshop.com/scotch-whisky?item_availability=In+Stock']
def parse(self, response):
for products in response.css('div.product-item-info'):
l = ItemLoader(item = WhiskyscraperItem(), selector=products)
l.add_css('name', 'a.product-item-link')
l.add_css('price', 'span.price')
l.add_css('link', 'a.product-item-link::attr(href)')
yield l.load_item()
next_page = response.css('a.action.next').attrib['href']
if next_page is not None:
yield response.follow(next_page, callback=self.parse)
I don't know if i am doing something wrong but the code works well but its just the denied 403 response, what can I do?
Solution
@Barry the Platipus
already has nicely stated that the website is in under Cloudflare protection
. So sending general requests wouldn't work here. That's why the general rule of thumbs is that you can apply either cloud scraper or selenium
. I used both of them cloudscraper
and Scrapy/Selenium with scrapy/scrapy-SeleniumRequest
none of them didn't work. scrapy-SeleniumRequest
returns 200
response status but empty output and generates only some Cloudflare
talks But only powerful original Selenium
engine with BeautifulSoup
works like a charm!
Working code as an example:
from selenium import webdriver
import time
from bs4 import BeautifulSoup
from selenium.webdriver.chrome.service import Service
import pandas as pd
webdriver_service = Service("./chromedriver") #Your chromedriver path
driver = webdriver.Chrome(service=webdriver_service)
data = []
for page in range(0, 8):
driver.get(f'https://www.whiskyshop.com/scotch-whisky?item_availability=In+Stock&p={page}')
driver.maximize_window()
time.sleep(8)
soup = BeautifulSoup(driver.page_source,"html.parser")
for card in soup.select('div[class="products wrapper grid products-grid"] > ol > li > div.product-item-info'):
title = card.h3.get_text(strip=True)
price = card.select_one('span.price').get_text(strip=True) if card.select_one('span.price') else None
link=card.a.get('href')
data.append({
'title':title,
'price':price,
'link':link
})
df = pd.DataFrame(data)
print(df)
Output:
title ... link
0 Bunnahabhain 12 Year Old Cask Strength 2022 ... https://www.whiskyshop.com/bunnahabhain-12-yea...
1 Lagavulin 12 Year Old Special Release 2022 ... https://www.whiskyshop.com/lagavulin-12-year-o...
2 Johnnie Walker Ghost & Rare Port Dundas ... https://www.whiskyshop.com/johnnie-walker-ghos...
3 Cardhu 16 Year Old Special Release 2022 ... https://www.whiskyshop.com/cardhu-16-year-old-...
4 Speyside #1 50 Year Old Batch 5 That Boutique-... ... https://www.whiskyshop.com/speyside-1-50-year-...
.. ... ... ...
795 Glenkeir Treasure Carribean Blended Rum ... https://www.whiskyshop.com/glenkeir-carribean-...
796 The Loch Fyne Caol Ila 10 Year Old 2010 ... https://www.whiskyshop.com/loch-fyne-caol-ila-...
797 Glen Moray Warehouse 1 1998 Barolo Finish ... https://www.whiskyshop.com/glen-moray-warehous...
798 Cardhu 14 Year Old Diageo Special Release 2021 ... https://www.whiskyshop.com/cardhu-14-year-old-...
799 Lagavulin 26 Year Old Diageo Special Release 2021 ... https://www.whiskyshop.com/lagavulin-26yo-diag...
[800 rows x 3 columns]
Answered By - Fazlul
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.