Issue
I am trying to scrape a website using the crawl spider. When I am running crawl on command line I am getting type error - start_requests() accepts 1 positional argument, 3 were given. I checked the the middleware settings where def process_start_requests(self, start_requests, spider) has 3 arguments. I had referred to this problem- scrapy project middleware -TypeError: process_start_requests() takes 2 positional arguments but 3 were given but am not being able to solve the issue.
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from scrapy import Request
class FpSpider(CrawlSpider):
name = 'fp'
allowed_domains = 'foodpanda.com.bd'
rules = (Rule(LinkExtractor(allow=('product', 'pandamart')),
callback='parse_items', follow=True, process_request='start_requests'),)
def start_requests(self):
yield Request(url='https://www.foodpanda.com.bd/darkstore/vbpl/pandamart-gulshan-2', meta=dict(playwright=True),
headers={
'sec-ch-ua': '"Google Chrome";v="105", "Not)A;Brand";v="8", "Chromium";v="105"',
'Accept': 'application/json, text/plain, */*',
'Referer': 'https://www.foodpanda.com.bd/',
'sec-ch-ua-mobile': '?0',
'X-FP-API-KEY': 'volo',
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36',
'sec-ch-ua-platform': '"macOS"'
}
)
def parse_items(self, response):
item = {}
item['name'] = response.css('h1.name::text').get()
item['price'] = response.css('div.price::text').get()
item['original_price'] = response.css('div.original-price::text').get()
yield item
The error looks like this: Scrapy type error
Solution
The problem is this statement: process_request='start_requests'
.
start_request
is reserved and used for the first request. If you want to enable Playwright for the subsequent requests, which I assume you are trying to do using process_requests
, you would need to use a different name for that function.
See the following code:
def enable_playwright(request, response):
request.meta["playwright"] = True
return request
class FpSpider(CrawlSpider):
name = "fp"
allowed_domains = ["foodpanda.com.bd"]
rules = (Rule(LinkExtractor(allow=('product', 'pandamart')),
callback='parse_items',
follow=True,
process_request=enable_playwright # Note a different function name
# process_request='start_requests' #THIS was the problem
),)
# Rest of the code here
Also note that allowed_domains
is a list, not a string.
Answered By - Upendra
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.