Issue
I'm working on a Scrapy spider to scrape data from multiple pages of a website. The goal is to crawl through all the pages of each start URL, but I want the spider to stop after crawling the maximum pages for each start URL. However, the spider is not working as expected, and it does not crawl through all the pages.
I've tried to implement the count management using a dictionary to keep track of the number of pages crawled for each URL. Here is my current implementation:
def parse(self, response):
# Get the current count from the request's
self.counts = {url: 1 for url in self.start_urls} # Initialize count for each start URL
# Check if the count has reached 100
count = self.counts[response.url]
if count > 100:
return # Stop crawling further
# Increment the count for the next page
self.counts[response.url] += 1
# Parse the items on the current page
for result in response.xpath(".//h2/a"):
yield scrapy.Request(url=result.xpath("@href").extract_first(), callback=self.parse_item)
# Generate the URL for the next page and request it
next_page_url = response.url + f"?page={count}"
yield scrapy.Request(next_page_url, callback=self.parse)
The spider seems to start and crawl some pages, but it stops before reaching all pages for each start URL. I'm not sure where I am going wrong. How can I modify the spider to ensure it crawls through all pages up to page 100 for each start URL? Any help or insights would be greatly appreciated.
Thank you in advance!
Solution
There are a number of issues with your approach.
you are initialising the counting mechanism inside of the parse method. This means that every time the parse method is called it is reconstructing the object and deleting the previous version.
Then on the very next line you query what the count is for the current url, so the result for this is either going to be 1 or it is going to throw a KeyError exception in the case where the request.url is not among the start_urls.
Then you check if the count is greater than 100 which will always be False because if it survived the previous step that means it is among the
start_urls
and therefore its count will be 1 because you just initialized 2 instructions before.Finally at the end you are generating a new url for the next page, and creating a request to be parsed by the same method, which means that this request will likely not be among the
start_urls
pretty much guaranteeing that it will throw the KeyError exception I mentioned earlier.
So really what you have done is create a way that all but guarantees that you will never get past the first page for any of the urls in start_urls, and your counter will never have a chance to get to 3 let alone 100.
A better alternative would be to initialize the counter dictionary outside of the parse method and have it be a class attribute to the spider rather than an instance attribute. But even then this would not acheive your desired goal because each next_page
generated by each start_url
request is going to have a unique url and therefore will not contribute to the counter in the way that you are seeking anyway.
A better alternative would be to override the start_requests method and include a counter in the cb_kwargs paramter to the initial request that you can manually increment and pass along to each next page until it reaches 100.
For example:
class MySpider(scrapy.Spider):
name = "spidername"
start_urls = [...]
def start_requests(self):
for url in self.start_urls:
yield scrapy.Request(url, callback=self.parse, cb_kwargs={"count": 1})
def parse(self, response, count=None):
if count >= 100:
return
for result in response.xpath(".//h2/a"):
yield scrapy.Request(url=result.xpath("@href").extract_first(), callback=self.parse_item)
next_page_url = response.url + f"?page={count}"
yield scrapy.Request(next_page_url, callback=self.parse, cb_kwargs={"count": count + 1})
Answered By - Alexander
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.