Friday, October 28, 2022

[FIXED] 403 response when using scrapy python

October 28, 2022 python, scrapy, web-crawler No comments

Issue

I am trying to learn scrapy and do crawl for a website, but I am getting a 403 response when doing crawl

this is my spider:

import scrapy
from scrapy.loader import ItemLoader
from itemloaders.processors import TakeFirst, MapCompose
from w3lib.html import remove_tags

def remove_currency(value):
    return value.replace('£','').strip()

class WhiskyscraperItem(scrapy.Item):
    # define the fields for your item here like:
    name = scrapy.Field(input_processor = MapCompose(remove_tags), output_processor = TakeFirst())
    price = scrapy.Field(input_processor = MapCompose(remove_tags, remove_currency), output_processor = TakeFirst())
    link = scrapy.Field()

class WhiskeySpider(scrapy.Spider):
    name = 'whisky'
    start_urls = ['https://www.whiskyshop.com/scotch-whisky?item_availability=In+Stock']

    def parse(self, response):
        for products in response.css('div.product-item-info'):
            l = ItemLoader(item = WhiskyscraperItem(), selector=products)

            l.add_css('name', 'a.product-item-link')
            l.add_css('price', 'span.price')
            l.add_css('link', 'a.product-item-link::attr(href)')

            yield l.load_item()

        next_page = response.css('a.action.next').attrib['href']
        if next_page is not None:
            yield response.follow(next_page, callback=self.parse)

I don't know if i am doing something wrong but the code works well but its just the denied 403 response, what can I do?

Solution

@Barry the Platipus already has nicely stated that the website is in under Cloudflare protection. So sending general requests wouldn't work here. That's why the general rule of thumbs is that you can apply either cloud scraper or selenium. I used both of them cloudscraper and Scrapy/Selenium with scrapy/scrapy-SeleniumRequest none of them didn't work. scrapy-SeleniumRequest returns 200 response status but empty output and generates only some Cloudflare talks But only powerful original Selenium engine with BeautifulSoup works like a charm!

Working code as an example:

from selenium import webdriver
import time
from bs4 import BeautifulSoup
from selenium.webdriver.chrome.service import Service
import pandas as pd
webdriver_service = Service("./chromedriver") #Your chromedriver path
driver = webdriver.Chrome(service=webdriver_service)

data = []
for page in range(0, 8):
    driver.get(f'https://www.whiskyshop.com/scotch-whisky?item_availability=In+Stock&p={page}')
    driver.maximize_window()
    time.sleep(8)


    soup = BeautifulSoup(driver.page_source,"html.parser")
    for card in soup.select('div[class="products wrapper grid products-grid"] > ol > li > div.product-item-info'): 
        title = card.h3.get_text(strip=True) 
        price = card.select_one('span.price').get_text(strip=True) if card.select_one('span.price') else None
        link=card.a.get('href')

        data.append({
            'title':title,
            'price':price,
            'link':link
            })

df = pd.DataFrame(data)
print(df)

Output:

                                                 title  ...                                               link
0          Bunnahabhain 12 Year Old Cask Strength 2022  ...  https://www.whiskyshop.com/bunnahabhain-12-yea...      
1           Lagavulin 12 Year Old Special Release 2022  ...  https://www.whiskyshop.com/lagavulin-12-year-o...      
2              Johnnie Walker Ghost & Rare Port Dundas  ...  https://www.whiskyshop.com/johnnie-walker-ghos...      
3              Cardhu 16 Year Old Special Release 2022  ...  https://www.whiskyshop.com/cardhu-16-year-old-...      
4    Speyside #1 50 Year Old Batch 5 That Boutique-...  ...  https://www.whiskyshop.com/speyside-1-50-year-...      
..                                                 ...  ...                                                ...      
795            Glenkeir Treasure Carribean Blended Rum  ...  https://www.whiskyshop.com/glenkeir-carribean-...      
796            The Loch Fyne Caol Ila 10 Year Old 2010  ...  https://www.whiskyshop.com/loch-fyne-caol-ila-...      
797          Glen Moray Warehouse 1 1998 Barolo Finish  ...  https://www.whiskyshop.com/glen-moray-warehous...      
798     Cardhu 14 Year Old Diageo Special Release 2021  ...  https://www.whiskyshop.com/cardhu-14-year-old-...      
799  Lagavulin 26 Year Old Diageo Special Release 2021  ...  https://www.whiskyshop.com/lagavulin-26yo-diag...      

[800 rows x 3 columns]

Answered By - Fazlul

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Friday, October 28, 2022

[FIXED] 403 response when using scrapy python

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels